Summary of your 'study carrel'
==============================

This is a summary of your Distant Reader 'study carrel'.

The Distant Reader harvested & cached your content into a
collection/corpus. It then applied sets of natural language
processing and text mining against the collection. The results of
this process was reduced to a database file -- a 'study carrel'.
The study carrel can then be queried, thus bringing light
specific characteristics for your collection. These
characteristics can help you summarize the collection as well as
enumerate things you might want to investigate more closely.

This report is a terse narrative report, and when processing 
is complete you will be linked to a more complete narrative
report. 

                               Eric Lease Morgan <emorgan@nd.edu>


Number of items in the collection; 'How big is my corpus?'
----------------------------------------------------------
118


Average length of all items measured in words; "More or less, how big is each item?"
------------------------------------------------------------------------------------
9974


Average readability score of all items (0 = difficult; 100 = easy)
------------------------------------------------------------------
51


Top 50 statistically significant keywords; "What is my collection about?"
-------------------------------------------------------------------------
119	sequence
26	RNA
24	dna
21	protein
13	virus
11	structure
11	SARS
10	genome
8	gene
8	Fig
7	model
6	PCR
5	study
5	figure
5	cell
5	acid
4	viral
4	result
4	peptide
4	human
4	high
4	disease
4	bind
4	activity
4	University
4	NMR
3	receptor
3	plant
3	method
3	interaction
3	Table
3	NCBI
3	India
3	CNN
3	CMV
2	vaccine
2	site
2	sequencing
2	residue
2	read
2	probe
2	phylogenetic
2	mutation
2	metagenomic
2	isolate
2	information
2	function
2	feature
2	enzyme
2	domain


Top 50 lemmatized nouns; "What is discussed?"
---------------------------------------------
11814	sequence
8325	protein
3842	virus
3361	structure
3331	peptide
2943	cell
2929	gene
2782	genome
2659	method
2476	acid
2212	analysis
1942	%
1936	study
1908	model
1865	datum
1807	number
1726	dna
1707	result
1659	region
1503	activity
1464	amino
1379	alignment
1332	domain
1322	residue
1316	site
1312	group
1287	type
1278	interaction
1277	function
1257	approach
1255	database
1253	disease
1233	time
1179	receptor
1102	sample
1044	information
992	set
981	system
974	value
964	family
948	mutation
941	sequencing
924	length
917	similarity
905	dataset
904	motif
898	membrane
882	c
877	tree
858	specie


Top 50 proper nouns; "What are the names of persons or places?"
--------------------------------------------------------------
2113	al
1831	et
1615	.
1463	RNA
808	C
610	Fig
580	SARS
515	PCR
446	Table
393	A
363	N
362	k
360	DNA
359	T
350	S
337	Genome
333	University
328	NMR
324	II
280	DeepRC
273	M
269	±
264	NCBI
258	Protein
251	CoV-2
246	HCV
242	B
225	L
219	fl
219	HIV
218	K
216	E.
213	j
209	D
207	de
197	GenBank
191	Virus
184	India
170	Human
169	bp
168	LSTM
164	Institute
164	F
164	CNN
160	RT
156	China
154	MS
151	Gly
150	G
150	C.


Top 50 personal pronouns nouns; "To whom are things referred?"
-------------------------------------------------------------
6055	we
2912	it
1084	they
873	i
445	them
206	us
138	one
93	he
84	itself
44	themselves
19	you
12	him
11	she
9	me
6	ourselves
4	yÞ
3	l1oc
3	himself
3	her
2	ppifs
2	p450s
2	n40np
2	mine
2	ifnyr-/-mice
2	https://github.com/ababaian/serratus
2	em
2	cb562
1	³hser
1	yegfp
1	y_~
1	y401
1	y
1	w@
1	u
1	tlg1
1	sod-3::gfp
1	s
1	pgem2dhfr
1	p110a
1	ours
1	n−3
1	nthash
1	myself
1	iv-3l3r.
1	insl3
1	icmv1
1	https://serratus.io
1	hc-201
1	hbs06
1	fbp17


Top 50 lemmatized verbs; "What do things do?"
---------------------------------------------
39197	be
7054	have
5776	use
2586	show
2223	base
1559	bind
1556	find
1419	contain
1399	identify
1296	include
1118	provide
1013	do
982	know
975	obtain
922	represent
884	determine
873	compare
842	suggest
840	give
840	generate
829	develop
781	indicate
755	increase
728	follow
704	perform
702	describe
700	see
699	predict
695	allow
694	involve
686	make
663	reveal
655	lead
648	associate
646	form
637	study
624	consider
617	observe
606	report
598	detect
597	result
596	produce
567	require
566	propose
565	relate
542	express
534	induce
530	characterize
528	isolate
526	cause


Top 50 lemmatized adjectives and adverbs; "How are things described?"
---------------------------------------------------------------------
2727	not
2245	also
2032	different
1976	high
1970	other
1855	-
1715	viral
1688	more
1678	such
1568	new
1509	human
1313	only
1306	well
1233	most
1165	large
1122	specific
1093	molecular
1091	first
1083	however
1070	structural
919	many
876	small
844	then
833	similar
833	low
831	important
814	single
811	several
794	biological
746	multiple
743	as
738	novel
733	non
727	same
727	immune
726	thus
660	long
656	highly
656	functional
652	possible
649	therefore
634	available
630	very
626	nucleotide
616	further
604	various
600	present
585	genetic
579	genomic
574	phylogenetic


Top 50 lemmatized superlative adjectives; "How are things described to the extreme?"
-------------------------------------------------------------------------
394	most
232	good
189	least
146	high
99	Most
85	large
47	small
45	near
38	low
35	short
32	close
31	long
29	late
20	strong
17	great
16	early
15	simple
10	bad
6	fast
6	big
4	easy
4	-peptides
3	weak
3	old
2	®
2	wide
2	thick
2	new
2	hot
2	clear
2	-which
2	-methylated
2	-hybrid
1	~15
1	young
1	tight
1	slow
1	slim
1	slight
1	setcov
1	rugosa
1	quick
1	preS1
1	poor
1	poly(U
1	pdqu
1	molossid
1	mean:-42
1	loose
1	little


Top 50 lemmatized superlative adverbs; "How do things do to the extreme?"
------------------------------------------------------------------------
839	most
117	least
29	well
3	shortest
3	long
3	highest
3	clustalw
2	worst
1	~3
1	smallest
1	near
1	fast


Top 50 Internet domains; "What Webbed places are alluded to in this corpus?"
----------------------------------------------------------------------------
23	github.com
22	www.ncbi.nlm.nih.gov
10	www
7	serratus.io
6	doi.org
4	www.niaid.nih.gov
4	www.ncbi
3	www.ncbi.nlm.nih
3	www.mdpi.com
3	www.gisaid.org
3	www.ebi.ac.uk
3	image.thelancet.com
2	www.who.int
2	www.sternadi.com
2	www.rcsb.org
2	www.broadinstitute.org
2	tree.bio.ed.ac.uk
2	pave.niaid.nih.gov
2	opensource.googleblog.com
2	mol.ax
2	evolution.genetics.washington.edu
2	earthmicrobiome.org
2	creativecommons.org
2	covdb.microbiology.hku.hk
2	compbio.dfci.harvard.edu
2	clients.adaptivebiotech.com
2	blast.ncbi.nlm.nih.gov
2	alla.cs.gsu.edu
1	xmtb
1	wwwmg
1	www3.niaid.nih.gov
1	www.wheatgenome.org
1	www.virology.wisc.edu
1	www.vetmed.ucdavis.edu
1	www.uniprot.org
1	www.spss.com
1	www.sanger.ac.uk
1	www.rostlab.org
1	www.ridom.com
1	www.predictprotein.org
1	www.picb.ac.cn
1	www.phred.org
1	www.pdb.org
1	www.oxfordjournals.org
1	www.ostp
1	www.microsoft.com
1	www.mg-rast.org
1	www.megasoft
1	www.kazusa.or.jp
1	www.jcvi.org


Top 50 URLs; "What is hyperlinked from this corpus?"
----------------------------------------------------
10	http://www
4	http://www.ncbi
4	http://github.com/ml-jku/DeepRC
3	http://www.ncbi.nlm.nih.gov/
3	http://www.ncbi.nlm.nih
3	http://www.gisaid.org/
3	http://serratus.io/access
3	http://serratus.io
2	http://www.sternadi.com/phyvirus
2	http://www.niaid.nih.gov/dmid/genomes/
2	http://www.ncbi.nlm.nih.gov/genome/viruses/variation/
2	http://www.ncbi.nlm.nih.gov/genome/
2	http://pave.niaid.nih.gov/
2	http://github.com/spro/practical-pytorch
2	http://github.com/serratus-bio/tantalus
2	http://github.com/ababaian/serratus
2	http://github.com/
2	http://earthmicrobiome.org/protocols-and-standards/16s/
2	http://covdb.microbiology.hku.hk
2	http://clients.adaptivebiotech.com/pub/Emerson-2017-NatGen
2	http://blast.ncbi.nlm.nih.gov/Blast.cgi
2	http://alla.cs.gsu.edu/~software/VISPA/vispa.html
1	http://xmtb
1	http://wwwmg
1	http://www3.niaid.nih.gov/research/topics/
1	http://www.who.int/tdr
1	http://www.who.int/mediacentre/
1	http://www.wheatgenome.org/
1	http://www.virology.wisc.edu/acp/Aligns/seq_align.html
1	http://www.vetmed.ucdavis.edu/ohi/predict/index.cfm
1	http://www.uniprot.org/
1	http://www.spss.com/
1	http://www.sanger.ac.uk/Projects/
1	http://www.rostlab.org/
1	http://www.ridom.com/seqsphere/
1	http://www.rcsb.org/structure/6M2N
1	http://www.rcsb.org/pdb/
1	http://www.predictprotein.org/
1	http://www.picb.ac.cn/
1	http://www.phred.org
1	http://www.pdb.org/
1	http://www.oxfordjournals.org/nar/database/
1	http://www.ostp
1	http://www.niaid.nih.gov/dmid/genomes/mscs/
1	http://www.niaid.nih.gov/dmid/
1	http://www.ncbi.nlm.nih.gov/sutils/pasc
1	http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394299/
1	http://www.ncbi.nlm.nih.gov/nuccore/
1	http://www.ncbi.nlm.nih.gov/gorf/gorf.html
1	http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses


Top 50 email addresses; "Who are you gonna call?"
-------------------------------------------------
2	journals.permissions@oup.com
1	ytliu@ucsd.edu
1	ncbi-help@ncbi.nlm.nih.gov
1	lichwun@163.com
1	krishna.bhattiprolu@uni-graz.at
1	ihh@berkeley.edu
1	baydin2@cs.gsu.edu
1	mara.kozic@liverpool.ac.uk


Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?"
-------------------------------------------------------------------------------
11	results are consistent
10	sequence is not
10	sequences are not
8	sequences are similar
7	data are available
7	method does not
7	sequences are also
7	sequences are often
7	sequences do not
7	sequences were then
6	methods do not
6	sequence was not
6	sequences were not
5	data are not
5	domains do not
5	methods are mostly
5	protein is not
5	proteins do not
5	sequence does not
5	sequence is present
5	sequence was also
5	sequences are then
5	sequences were available
4	method is deeprc
4	model is able
4	models using integrated
4	peptides were also
4	protein binding sites
4	proteins are very
4	sequences are available
4	sequences have significantly
4	sequences indicate higher
4	sequences were randomly
4	viruses are not
3	activity is also
3	activity was not
3	alignments were manually
3	data do not
3	gene does not
3	gene finding hmm
3	gene is also
3	gene is not
3	genes containing introns
3	genomes do not
3	group is present
3	groups has consistently
3	method is effective
3	method is highly
3	method is very
3	method performs well


Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?"
---------------------------------------------------------------------------------------
5	sequences have no mutation
2	studies are not homologous
1	acids are not common
1	acids are not yet
1	alignment is not trivial
1	alignments is not practical
1	analysis is not unique
1	analysis showed no segregation
1	cells are not only
1	cells are not well
1	data are not complete
1	data are not directly
1	data are not forthcoming
1	data are not readily
1	data are not suitable
1	data do not necessarily
1	data showed no evidence
1	data showed no significant
1	data were not correctly
1	dna was not certain
1	domain is not present
1	domain is not yet
1	domains do not appreciably
1	gene is not essential
1	gene is not lethal
1	genes are not identical
1	method is not generally
1	method is not practical
1	methods are not suffi
1	methods do not always
1	methods do not easily
1	models are not sufficiently
1	models have not yet
1	models were not able
1	number are not necessarily
1	number is not always
1	number is not linear
1	peptide is not effi
1	peptide showed no hemolytic
1	peptides are not ideal
1	peptides are not stable
1	peptides do no exhibit
1	peptides have no structure
1	peptides show no pressor
1	peptides was not easy
1	protein has no role
1	protein is not catalytically
1	protein is not entirely
1	protein is not solely
1	proteins are not strictly


A rudimentary bibliography
--------------------------
      id = cord-274056-9t3kneoo
  author = Abd Elwahaab, Marwa A.
   title = A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector
    date = 2019-05-08
keywords = protein; sequence
 summary = title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector For beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in Table 1 . The similarity/dissimilarity vectors that are corresponding to beta globin, ND5, and spike protein sequences are illustrated in Tables 9, 10, and 11, respectively, based on the two methods discussed before. The results in Table 10 show that both the magnitude ( 5 ) and the angle ( 5 ) can measure similarity/dissimilarity degree well among ND5 protein sequences as shown in Figure 2 . The similarity/dissimilarity analysis among the seven beta globin sequences measured according to ( 5 ) is illustrated in Table 12 and shown in Figure 4 . The similarity/dissimilarity analysis among the beta globin sequences measured according to (GR spike ) is illustrated in Table 14 and shown in Figure 6 .
     doi = 10.1155/2019/8702968

      id = cord-279528-41atidai
  author = Abo-Elkhier, Mervat M.
   title = Measuring Similarity among Protein Sequences Using a New Descriptor
    date = 2019-11-22
keywords = Table; sequence
 summary = Each amino acid in the protein sequence is represented by a number, and a new 2D graphical representation is suggested. A new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (A t , SA t ). e 2D graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in e 2D graphical representation of TGEVG from class I and GD03T0013 from SARS_CoV protein sequences is illustrated in Figures 4(a) and 4(b) respectively. A new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean A t and standard deviation SA t of the combined intensity level value A t (i) of the protein sequence. F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids
     doi = 10.1155/2019/2796971

      id = cord-287634-64zqe4cz
  author = Al-Ssulami, Abdulrakeeb M.
   title = CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents
    date = 2020-01-31
keywords = sequence
 summary = For generating synthetic coding sequences with pre-specified amino acid sequence and desired GC-content, there exist two stochastic methods, multinomial and maximum entropy. In this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired GC-content. Thus, identifying over/under-represented regulatory elements or genome-scale patterns relies on generating random sequences that obey the pre-specified amino acid sequence and GC-content constraints. A more restricted method was presented recently, which the authors named NullSeq. NullSeq [10] uses the maximum entropy approach where the synonymous codon usage probability is derived from a strict function that expresses the expected GC-content in the reference amino acid sequence. We ran both tools, CodSeqGen and NullSeq [10] , to generate 1000 coding sequences given the primary amino acid sequence and the target GC-content of the reference coding sequence. NullSeq: a tool for generating random coding sequences with desired amino acid and GC contents
     doi = 10.1016/j.ygeno.2019.02.002

      id = cord-102766-n6mpdhyu
  author = Alam, Md. Nafis Ul
   title = Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses
    date = 2020-06-25
keywords = RNA; feature; sequence
 summary = title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. VirFinder: a novel k-mer based tool for identifying viral sequences from 558 assembled metagenomic data.
     doi = 10.1101/2020.06.25.170779

      id = cord-018133-2otxft31
  author = Altman, Russ B.
   title = Bioinformatics
    date = 2006
keywords = datum; dna; information; sequence; structure
 summary = Experimentation and bioinformatics have divided the research into several areas, and the largest are: (1) genome and protein sequence analysis, (2) macromolecular structure-function analysis, (3) gene expression analysis, and (4) proteomics. With the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. The Entrez system from the National Center for Biological Information (NCBI) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the Human Genome Project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources.
     doi = 10.1007/0-387-36278-9_22

      id = cord-010260-8lnpujip
  author = Anthonsen, Henrik W.
   title = The blind watchmaker and rational protein engineering
    date = 1994-08-31
keywords = Fig; NMR; electrostatic; method; protein; sequence; structure
 summary = 
     doi = 10.1016/0168-1656(94)90152-x

      id = cord-000473-jpow6iw1
  author = Astrovskaya, Irina
   title = Inferring viral quasispecies spectra from 454 pyrosequencing reads
    date = 2011-07-28
keywords = HCV; read; sequence
 summary = High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Given a collection of 454 pyrosequencing reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population.
     doi = 10.1186/1471-2105-12-s6-s1

      id = cord-035033-osjy88rc
  author = Aydin, Berkay
   title = Spatiotemporal event sequence discovery without thresholds
    date = 2020-11-09
keywords = ESMINER; RAND; event; sequence
 summary = Here, we introduce a novel algorithm, RAND-ESMINER, which, by randomly repeating the mining process on a random subset of instances and follow relationships, finds an estimate participation index for event sequences. The RAND-ESMINER uses our pattern growth-based ESGROWTH algorithm [4] as the backbone, where the follow relationships are translated into a directed acyclic graph structure, and randomly permutes the edges of this graph to mine the event sequences. They defined a follow relation between the pointbased event instances of two different event types, presented significance measures for sequences, and introduced two pattern-growth based algorithms for the mining task. In this paper, we will focus on mining STESs using a randomization approach, which will take a set of spatiotemporal event instances as input and returns all the discovered STESs together with a list of estimated participation index values for each STES, obtained from randomized trials.
     doi = 10.1007/s10707-020-00427-6

      id = cord-000257-ampip7od
  author = Bagowski, Christoph P
   title = The Nature of Protein Domain Evolution: Shaping the Interaction Network
    date = 2010-08-17
keywords = domain; evolution; protein; sequence
 summary = With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. This likely stems from the fact that they are required to participate in many different interactions, which makes selection pressures more stringent and the appearance of the branches on phylogenetic trees relatively short and more difficult to assess when co-evolutionary data in terms of other domains in the same gene family or expression patterns is limited [42, 63] . This approach thus primarily focuses on the similarity and differences of the orthologous genes within network, and is therefore ideally suited for the study of protein domain evolution and has already revealed that species-specific parts Fig.
     doi = 10.2174/138920210791616725

      id = cord-003316-r5te5xob
  author = Balloux, Francois
   title = From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic
    date = 2018-12-17
keywords = AMR; WGS; clinical; genome; sequence; sequencing
 summary = WGS-based strain identification gives a far superior resolution In principle, WGS can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. As an example, genome assembly might appear to be a bottleneck for real-time WGS diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. These include, among others: the current costs of WGS, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable AMR and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols.
     doi = 10.1016/j.tim.2018.08.004

      id = cord-291156-zxg3dsm3
  author = Bernasconi, Anna
   title = Empowering Virus Sequences Research through Conceptual Modeling
    date = 2020-05-01
keywords = SARS; VCM; sequence; virus
 summary = 
     doi = 10.1101/2020.04.29.067637

      id = cord-304869-l6a68tqn
  author = Bielińska-Wąż, Dorota
   title = Graphical and numerical representations of DNA sequences: statistical aspects of similarity
    date = 2011-08-28
keywords = Fig; Table; dna; sequence
 summary = As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. How to restrict the graphs representing the sequences to two-dimensional plots and how to avoid degeneracies has been the subject of numerous studies which resulted in many graphical representations (see subsequent chapters). It is shown in the last chapter of this work that by using the four-component spectral representation one can recognize the difference in one base between a pair of sequences so it can be used for single nucleotide polymorfism (SNP) analyses which is subject of many investigation, as for example, in a recent work by Bhasi et al.
     doi = 10.1007/s10910-011-9890-8

      id = cord-310734-6v7oru2l
  author = Bolatti, Elisa M.
   title = A Preliminary Study of the Virome of the South American Free-Tailed Bats (Tadarida brasiliensis) and Identification of Two Novel Mammalian Viruses
    date = 2020-04-09
keywords = Genomoviridae; Rep; bat; dna; sequence
 summary = 
     doi = 10.3390/v12040422

      id = cord-334127-wjf8t8vp
  author = Brister, J. Rodney
   title = NCBI Viral Genomes Resource
    date = 2015-01-28
keywords = NCBI; Viral; sequence
 summary = This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. Whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes (11) (12) (13) , sequence assembly and virus discovery (14) (15) (16) (17) , viral dynamics and evolution (18) (19) (20) and pathogen detection (14, (21) (22) (23) . The second model captures and standardizes host information for all viruses, and whenever a new RefSeq record is created, a manually curated ''viral host'' property is assigned to the relevant species within the NCBI Taxonomy database. The link to the Retrovirus Resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the Retrovirus Genotyping Tool and HIV-1, Human Interaction Database (50, 51) .
     doi = 10.1093/nar/gku1207

      id = cord-203232-1nnqx1g9
  author = Canturk, Semih
   title = Machine-Learning Driven Drug Repurposing for COVID-19
    date = 2020-06-25
keywords = SARS; drug; sequence; virus
 summary = Using the National Center for Biotechnology Information virus protein database and the DrugVirus database, which provides a comprehensive report of broad-spectrum antiviral agents (BSAAs) and viruses they inhibit, we trained ANN models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. Using sequences for SARS-CoV-2 (the coronavirus that causes COVID-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating COVID-19. For Experiment II, we split the data on virus species, meaning the models were forced to predict drugs for a species that it was not trained on, and have to detect peptide substructures in the amino-acid sequences to suggest drugs. In post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence.
     doi = nan

      id = cord-328644-odtue60a
  author = Comandatore, Francesco
   title = Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes
    date = 2020-05-28
keywords = Coronavirus; SARS; sequence; variant
 summary = These variants might arise during the spread of the epidemic, as viruses are known for their high frequency of mutation, particularly in single stranded RNA viruses -as in the case of SARS-CoV-2 (Sanjuán and Domingo-Calap 2016) , which has a single, positive-strand RNA genome. To have a better insight on the history and spread of the COVID-19 pandemic in Italy and thanks to the sequences deposited in the Gisaid database, we identified 7 non synonymous mutations that are differentially frequent in Italian SARS-CoV-2 strains respect to strains circulating globally. Our analysis allowed us to identify 7 positions in four proteins that present drastic changes in amino acid frequencies when comparing Italian sequences with worldwide sequences available on Gisaid.org on April, 10, 2020 ( Figure 1 ).
     doi = 10.1101/2020.04.30.071027

      id = cord-268549-2lg8i9r1
  author = Dai, Qi
   title = Sequence comparison via polar coordinates representation and curve tree
    date = 2012-01-07
keywords = Randic; dna; sequence
 summary = It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation
     doi = 10.1016/j.jtbi.2011.09.030

      id = cord-002473-2kpxhzbe
  author = Das, Jayanta Kumar
   title = Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach
    date = 2017-03-31
keywords = acid; sequence
 summary = Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. The primary protein sequence is read as consecutive order pairs serially from first amino acid to the end of sequence, and each order pair is nothing but a connected edge between the two nodes where nodes in the graph are involved with different chemical groups of amino acids. Our method of phylogenetic tree formation used the dissimilarity matrix which is obtained for every pair of sequence on the basis of chemical group specific score of amino acids. Based on the phylogenetic tree of five members, we find that the PpcA and PpcD, PpcB and PpcE are mostly closed with regards to the frequency of amino acids of respective eight chemical groups.
     doi = 10.1371/journal.pone.0175031

      id = cord-004862-yv76yvy5
  author = Demers, G. William
   title = The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin
    date = 1989
keywords = Fig; ORF-1; sequence
 summary = title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. However, the region between the two ORFs is not conserved among species, and this observation is used to indicate possible start and stop codons for the ORFs. ORF-1 encodes a composite protein, and the 5'' half of ORF-1 from L1Oc is related to type II cytoskeletal keratin. The dot-plot analyses in Fig. 6 show that the internal sequence of L1Oc is very similar to both L1Md (mouse) and L1Hs (human) over very long segments, whereas the 5'' and 3'' ends are not conserved between species.
     doi = 10.1007/bf02106177

      id = cord-339915-8j04y50s
  author = Deng, Wei
   title = DV-Curve Representation of Protein Sequences and Its Application
    date = 2014-05-08
keywords = dna; sequence
 summary = Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins. In this paper, we introduce DV-curve graphical representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model of amino acids. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation New graphical representation of a DNA sequence based on the ordered dinucleotides and its application to sequence analysis Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation Similarity/dissimilarity studies of protein sequences based on a new 2d graphical representation
     doi = 10.1155/2014/203871

      id = cord-255194-4i9fc0r7
  author = Djikeng, Appolinaire
   title = Viral genome sequencing by random priming methods
    date = 2008-01-07
keywords = SISPA; coverage; sequence
 summary = An RNase treatment step was added to the SISPA protocol to reduce contaminating exogenous RNAs such as ribosomal RNAs. In the case of polyA-tailed viruses, we perform reverse transcription using a combination of random (FR26RV-N) and poly T tagged (FR40RV-T) primers in order to increase the coverage of the 3'' end ( Figure 2 ). Additionally, in order to capture 5'' ends of viral RNA, a random hexamer primer tagged with a conserved sequence at the 5'' end was added to the Klenow reaction (Figure 2 shows a 5'' oligo specific for rhinoviruses). The results of these experiments demonstrate that the SISPA method is very efficient as a genome sequencing method for samples with greater than 10 6 viral particles per RT-PCR reaction ( Figure 5 ). We strongly anticipate that specific adaptations of the SISPA method to conserved regions of different viruses will demonstrate its versatility in a wide range of viral genome sequencing initiatives.
     doi = 10.1186/1471-2164-9-5

      id = cord-266288-buc4dd5y
  author = Dong, Rui
   title = A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance
    date = 2019-04-09
keywords = ANV; Accumulated; sequence
 summary = Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ(18). The natural vector method performs well on many datasets (Deng et al., 2011; Yu et al., 2013b; Hoang et al., 2016; Li et al., 2016) , however, it only considers the number, average position and dispersion of positions of each nucleotide. In this paper, we propose a new Accumulated Natural Vector (ANV) method, which not only considers the basic property of each nucleotide, but also the covariance between them. In this paper, we propose an Accumulated Natural Vector approach, which projects each sequence into a point in R 18 , where the additional six dimensions describe the covariance between nucleotides.
     doi = 10.3389/fgene.2019.00234

      id = cord-033010-o5kiadfm
  author = Durojaye, Olanrewaju Ayodeji
   title = Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study
    date = 2020-10-02
keywords = Fig; SARS; model; protein; sequence; structure
 summary = RESULTS: This study describes the detailed computational process by which the 2019-nCoV main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3D structure. Our current study took advantage of the availability of the SARS CoV main proteinase amino acid sequence to map out the nucleotide coding region for the same protein in the 2019-nCoV. The predicted secondary structure composition shows a high degree of alpha helix and beta sheets, respectively, occupying 45 and 47% of the total residues with the percentage loop occupancy at 8% regarded as comparative modeling, constructs atomic models based on known structures or structures that have been determined experimentally and likewise share more than 40% sequence homology.
     doi = 10.1186/s43042-020-00081-5

      id = cord-001786-ybd8hi8y
  author = Dutilh, Bas E
   title = Metagenomic ventures into outer sequence space
    date = 2014-12-15
keywords = sequence; unknown
 summary = These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. Applications include the use of metagenomics for the discovery of novel genetic functionality, 2 for describing microbial ecosystems and tracking their variation, 3 in untargeted medical diagnostics and forensics, 4 and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. The level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database.
     doi = 10.4161/21597081.2014.979664

      id = cord-334394-qgyzk7th
  author = Edgar, Robert C.
   title = Petabase-scale sequence alignment catalyses viral discovery
    date = 2020-08-10
keywords = Extended; Figure; RNA; SRA; Serratus; genome; sequence
 summary = To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To expand the known repertoire of viruses and catalyse global virus discovery, in particular for Coronaviridae (CoV) family, we developed the Serratus cloud computing architecture for ultra-high throughput sequence alignment. We aligned 3,837,755 public RNA-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all GenBank CoV records clustered at 99% identity plus all non-retroviral RefSeq records for vertebrate viruses (see Methods and Extended Table 1 ). We performed de novo assembly on 52,772 runs potentially containing CoV sequencing reads by combining 37,131 SRA accessions identified by the Serratus search with 18,584 identified by an ongoing cataloguing initiative of the SRA called STAT [5] .
     doi = 10.1101/2020.08.07.241729

      id = cord-011565-8ncgldaq
  author = Elworth, R A Leo
   title = To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics
    date = 2020-06-04
keywords = Bloom; CMS; hash; sequence; set
 summary = For instance, in (1) a comprehensive review was performed covering probabilistic algorithms and data structures such as MinHash (6) and Locality Sensitive Hashing (LSH) (7) , Count-Min Sketch (CMS) (8) , HyperLogLog (9) and Bloom filters (10) . A more in depth discussion of many of these topics can also be found in (3, 4) includes a thorough review of compressed string indexes, LSH via sketches, CMS, Bloom filters, and minimizers (13) , with accompanying applications in genomics for each. With this approach, RAMBO can determine which datasets contain a given k-mer or sequence using far fewer Bloom filter queries, yielding a very fast sublinear-time sequence search algorithm (68) . One of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically MinHash and Minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses.
     doi = 10.1093/nar/gkaa265

      id = cord-256278-jvfjf7aw
  author = Feng, Jie
   title = New method for comparing DNA primary sequences based on a discrimination measure
    date = 2010-10-21
keywords = dna; sequence
 summary = title: New method for comparing DNA primary sequences based on a discrimination measure Three years after, Blaisdell (1989) proved that the dissimilarity values observed by using distance measures based on word frequencies are directly related to the ones requiring sequence alignment. In Table 2 , we present the similarity/dissimilarity matrix for the full DNA sequences of bÀglobin gene from 10 species listed in Table 1 by our new method. In Fig. 2, we show the phylogenetic tree of 10 bÀglobin gene sequences based on the distance matrix DM, using NJ method. In this paper, we propose a new method for the similarity analysis of DNA sequences. Our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of DNA sequences. Analysis of similarity/ dissimilarity of DNA sequences based on novel 2-D graphical representation A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words
     doi = 10.1016/j.jtbi.2010.07.040

      id = cord-016594-lj0us1dq
  author = Flower, Darren R.
   title = Identification of Candidate Vaccine Antigens In Silico
    date = 2012-09-28
keywords = MHC; antigen; prediction; protein; sequence; vaccine
 summary = In the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. When looking at a reverse vaccinology process, the discovery of candidate subunit vaccines begins with a microbial genome, perhaps newly sequence, progresses through an extensive computational stage, ultimately to deliver a shortlist of antigens which can be validated through subsequent laboratory examination. Conventional empirical, experimental, laboratory-based microbiological ways to identify putative candidate antigens require cultivation of target pathogenic micro-organisms, followed by teasing out their component proteins, analysis in a series of in-vitro and in-vivo assays, animal models and with the ultimate objective of isolating one or two proteins displaying protective immunity.
     doi = 10.1007/978-1-4614-5070-2_3

      id = cord-001974-wjf3c7a7
  author = Friis-Nielsen, Jens
   title = Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
    date = 2016-02-19
keywords = Sequencing; Table; cluster; sequence; virus
 summary = Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. The datasets went through a sequential pipeline with modules (in order) of preprocessing, computational subtraction of host sequences, low-complexity sequence removal, sequence assembly, clustering, association to metadata features, and taxonomical annotation. Associations from the shortest mode tended to have higher dispersion in the range of ORs. Furthermore, one block of clustering results using global alignment mode, alignment length based on the shortest contig, and a minimum sequence identity of 90% (c09ˆaSyG1), had an overall high range of ORs as well as the highest minimum values. The clusters are significantly associated with lowest p-values to biological features and the species annotations are described by HMP.
     doi = 10.3390/v8020053

      id = cord-016798-tv2ntug6
  author = Gautam, Ablesh
   title = Bioinformatics Applications in Advancing Animal Virus Research
    date = 2019-06-06
keywords = genome; sequence; tool; viral; virus
 summary = The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al.
     doi = 10.1007/978-981-13-9073-9_23

      id = cord-302798-q0mbngqy
  author = Ge, Junwei
   title = Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China
    date = 2018-06-14
keywords = MiCV; TAC; sequence
 summary = In this study, the role of circoviruses (CVs) in mink acute gastroenteritis was investigated, and the MiCV genome was molecularly characterized through sequence analysis. MiCVs and previously characterized CVs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus DNA replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (Rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5ʹ region; and (iv) motifs in Rep. Pairwise comparisons showed that the capsid proteins of MiCV shared the highest amino acid sequence identity with those of porcine CV (PCV) 2 (45.4%) and bat CV (BatCV) 1 (45.4%). In our study, sequence analysis confirmed that MiCV genomes displayed the characteristics of members of the genus Circovirus, and the common features included their genome organization, the presence of a potential stem-loop and conserved nonanucleotide motif postulated to be the origin of viral DNA replication, and major ORFs and repeats [26, 27] .
     doi = 10.1007/s00705-018-3908-5

      id = cord-017932-vmtjc8ct
  author = Georgiev, Vassil St.
   title = Genomic and Postgenomic Research
    date = 2009
keywords = NIAID; gene; genome; sequence
 summary = The family Enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (Salmonella, Yersinia, Klebsiella, Shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic Escherichia coli K12. To this end, NIAID has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. The availability of microbial and human DNA sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host''s immune response and an individual''s genetic susceptibility to pathogens. The PFGRC was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases.
     doi = 10.1007/978-1-60327-297-1_25

      id = cord-325043-vqjhiv7p
  author = Gorbalenya, Alexander E.
   title = An NTP-binding motif is the most conserved sequence in a highly diverged monophyletic group of proteins involved in positive strand RNA viral replication
    date = 1989
keywords = NTP; RNA; protein; sequence
 summary = 
     doi = 10.1007/bf02102483

      id = cord-328259-3g4klpyg
  author = Guajardo-Leiva, Sergio
   title = Metagenomic Insights into the Sewage RNA Virosphere of a Large City
    date = 2020-09-21
keywords = NCBI; RNA; Rotavirus; Trebal; sequence; viral
 summary = Despite the overrepresentation of dsRNA viruses, our results show that Santiago''s sewage RNA virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). Viral sequences identified as Partitiviridae-like viruses included in the "unclassified RNA viruses ShiM-2016" category in the NCBI taxonomy (~25% abundance; Figure 2B ) and Totiviriade family were also highly abundant in treated and untreated sewage samples from the EU [5, 7] . Therefore, the abundance of these viruses in the Trebal metagenome can expand the known sequence space associated with this family (only 10 genomes are currently available in the NCBI database) and contribute to a better understanding of the bacteriophage biology related to RNA genomes. Taken together, our results show that metagenomic surveys of RNA viruses in sewage samples and the use of HMMs could uncover extraordinary viral diversity through the detection of remote homologs in these human-impacted environments.
     doi = 10.3390/v12091050

      id = cord-354465-5nqrrnqr
  author = Haslinger, Christian
   title = RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties
    date = 1999
keywords = RNA; graph; secondary; sequence; structure
 summary = Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. In case of one particular class of biopolymers, the ribonucleic acid (RNA) molecules, decoding of information stored in the sequence can be properly decomposed into two steps: (i) formation of the secondary structure, that is, of the pattern of Watson-Crick (and GU) base pairs, and (ii) the embedding of the contact structure in three-dimensional space. On the other hand, an increasing number of experimental findings, as well as results from comparative sequence analysis, suggest that pseudo-knots are important structural elements in many RNA molecules (Westhof and Jaeger, 1992) .
     doi = 10.1006/bulm.1998.0085

      id = cord-348427-worgd0xu
  author = Hatcher, Eneida L.
   title = Virus Variation Resource – improved response to emergent viral outbreaks
    date = 2017-01-04
keywords = Resource; Variation; Virus; sequence
 summary = The resource now includes expanded data processing pipelines and analysis tools, and supports selection and retrieval of nucleotide and protein sequences from four new viral groups: Ebolaviruses, MERS coronavirus, rotavirus, and Zika virus ( Table 2 ). New processes have been added to parse source descriptor terms from Gen-Bank records and map these to controlled vocabulary, and the resource now supports retrieval of sequences based on standardized isolation source and host terms in addition to standardized gene and protein names. The resource includes data processing pipelines that retrieve sequences from GenBank, provide standardized gene and protein an-notation, and map sequence source descriptors (i.e. metadata) to uniform vocabularies. To resolve this issue, the Virus Variation database loading pipeline parses Gen-Bank records, identifies important metadata terms, such as sample isolation host, date, country and source, and maps these to a standardized vocabulary using a hierarchical approach.
     doi = 10.1093/nar/gkw1065

      id = cord-263987-ff6kor0c
  author = Holmes, Ian H.
   title = Solving the master equation for Indels
    date = 2017-05-12
keywords = Markov; RNA; model; sequence
 summary = BACKGROUND: Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT: This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS: While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
     doi = 10.1186/s12859-017-1665-1

      id = cord-330067-ujhgb3b0
  author = Huang, Yi
   title = CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes
    date = 2007-10-02
keywords = SARS; sequence
 summary = To overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, CoVDB (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. CoVDB provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. In CoVDB, with the aim of facilitating gene retrieval, we tried to unify the naming of these non-structural proteins from different groups of coronaviruses. When we compared their putative amino acid sequences to the corresponding ones in other group 1 coronavirus genomes using BLAST, as well as searching for conserved domains using motifscan, results showed that the putative proteins encoded by these ORFs belonged to a protein family in Pfam originally assigned as ''Corona_NS3b'' (accession number PF03053). database, CoVDB, of annotated coronavirus genes and genomes, which offers efficient batch sequence retrieval and analysis.
     doi = 10.1093/nar/gkm754

      id = cord-325985-xfzhn1n1
  author = Jabado, Omar J.
   title = Comprehensive viral oligonucleotide probe design using conserved protein regions
    date = 2007-12-13
keywords = Pfam; probe; sequence
 summary = The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. Our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. The EMBL nucleotide sequence database [July 2007, Release 91; 461,353 nucleic acid sequences (31) ] was chosen as the reference for this study because it is tightly integrated with the Pfam protein family database (23, 32 Taxon growth was estimated using a standard least squares method, with the SPSS statistical package. We have described a method that capitalizes on the Pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions.
     doi = 10.1093/nar/gkm1106

      id = cord-017354-cndb031c
  author = Janies, D.
   title = Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases
    date = 2008
keywords = H5N1; SARS; influenza; phylogenetic; sequence; tree
 summary = The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. Given a tree and a data matrix of sequences and features, the parsimony method can pinpoint the branches on which certain evolutionary events are inferred to occur between ancestor or descendent. Phylogenetic analysis of large genomic datasets can present several nested NPcomplete problems: multiple alignment, tree-search, and in some cases, gene order and complement differences among organisms. We provide exemplar cases in which phylogenetic analyses of viral genomes have been crucial to understand complex patterns of transmission among animal and human hosts: Severe Acute Respiratory Syndrome (SARS) [KSI03] and influenza [WEB92] . Molecular phylogenetic analyses of the nucleotide or inferred amino acid sequence data from various viral isolates can then be used to reconstruct the history of the transmission events the virus among hosts.
     doi = 10.1007/978-3-540-74331-6_2

      id = cord-017584-9rx4jlw8
  author = Kim, Kwangsoo
   title = Selecting Genotyping Oligo Probes Via Logical Analysis of Data
    date = 2007
keywords = probe; sequence
 summary = Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. As for the organization of this paper, we develop an effective method for selecting short oligo probes in Section 2 (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in Section 3 with using viral genomic sequences from the Los Alamos National Laboratory and the National Center of Biotechnology Information websites.
     doi = 10.1007/978-3-540-72665-4_8

      id = cord-324021-y1vr1db0
  author = Kozak, M.
   title = Determinants of translational fidelity and efficiency in vertebrate mRNAs
    date = 1994-12-31
keywords = AUG; codon; sequence; translation
 summary = 
     doi = 10.1016/0300-9084(94)90182-1

      id = cord-353290-1wi1dhv6
  author = Kustin, Talia
   title = Biased mutation and selection in RNA viruses
    date = 2020-09-28
keywords = Fig; RNA; sequence; virus
 summary = We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses. Nevertheless, RNA viruses do share several common features that drive their evolution: (a) their ultimate dependence on the cell, (b) their high mutation rates, (c) strong purifying selection derived from constraints operating on a small and densely coding genome, and (d) sporadic but powerful positive selection driven by an evolutionary arms race with the host they infect. Two non-mutually exclusive hypotheses may be put forth to explain the consistent pattern of A-richness that we observe: there is selection for more A in viral sequences, and/or there is a mutational bias that leads to more A in genomes of viruses.
     doi = 10.1093/molbev/msaa247

      id = cord-001340-kqcx7lrq
  author = Ladner, Jason T.
   title = Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing
    date = 2014-06-17
keywords = genome; sequence; viral
 summary = Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/ DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization.
     doi = 10.1128/mbio.01360-14

      id = cord-321150-ev6acl7b
  author = Lam, Ha Minh
   title = Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm
    date = 2017-10-03
keywords = sequence; site
 summary = Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed. A strong descent or ascent in the middle of a HGRW indicates that one type of informative site exhibits clustering, and the properties of the random walk can be used to compute exact probabilities of this occurring. To illustrate improved runtimes and memory usage of the new 3SEQ algorithm, we searched for recombinants among large sequence data sets of dengue virus serotype 2, Ebola virus, the coronavirus responsible for Middle-East Respiratory Syndrome (MERS) and Zika virus; see table 1. The genomic alignments of MERS and Zika virus contained 1,150 and 2,792 polymorphic sites, respectively, and >99.9% triplets were able to be tested for mosaicism with exact P values.
     doi = 10.1093/molbev/msx263

      id = cord-025610-7vouj8pp
  author = Latif, Seemab
   title = Backward-Forward Sequence Generative Network for Multiple Lexical Constraints
    date = 2020-05-06
keywords = sequence; word
 summary = In this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. Recently, Recurrent Neural Networks (RNNs) and their variants such as Long Short Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs) based language models have shown promising results in generating high quality text sequences, especially when the input and output are of variable length. first proposed multiple variants of Backward and Forward (B/F) language models based on GRUs for constrained sentence generation [13] . Therefore, we have proposed a neural probabilistic Backward-Forward architecture that can generate high quality sequences, with word embedding substitution method to satisfy multiple constraints. In this paper, we have proposed a novel method, dubbed Neural Probabilistic Backward-Forward language model and word embedding substitution method to address the issue of lexical constrained sequence generation.
     doi = 10.1007/978-3-030-49186-4_4

      id = cord-331698-rwow1ydx
  author = Latorre-Pérez, Adriel
   title = A lab in the field: applications of real-time, in situ metagenomic sequencing
    date = 2020-08-20
keywords = 16S; ONT; dna; metagenomic; sequence; sequencing
 summary = This review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis. Therefore, the ultra-portability, affordability, and speed in data production make the MinION technology suitable for real-time sequencing in a variety of environments, such as Ebola surveillance in West Africa during the last outbreak [25] , microbial communities inspection in the Arctic [26] , DNA sequencing on the International Space Station (ISS) [27] , and even the recently emerging pandemic coronavirus SARS-CoV-2 [28, 29] . In fact, there are some critical points to be addressed before this technique could become a standard in the industry: (i) sequencing cost should be reduced; (ii) rapid and reliable in situ DNA extraction and library preparation protocols should be designed and validated; (iii) minimal sequencing yields should be determined for each specific application; (iv) fast and real-time pipelines should be created and tested; and (v) level of expertise for managing the data and the samples should be notably reduced.
     doi = 10.1093/biomethods/bpaa016

      id = cord-252347-vnn4135b
  author = Lee, Wai-Ming
   title = A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants
    date = 2007-10-03
keywords = HRV; P1-P2; PCR; sequence
 summary = METHODS AND FINDINGS: To directly type HRVs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5'' noncoding region with homologous sequences of the 101 known serotypes. The degenerate primers EV292 and EV222 for PCR amplification of NIm-1A region were not sensitive enough for direct detection of small amount of HRV in original clinical samples (data not shown), and high titer infected cell lysates of cultured isolates were needed to produce enough PCR product for cloning and sequencing. This new assay had 3 key components: sensitive pan-HRV primers and semi-nested PCR to amplify P1-P2 region from cDNA prepared from original clinical specimens, a sequence database of 260-bp P1-P2 region of 5''NCR of all 101 HRV serotypes to serve as standard references for HRV identification, and phylogenetic tree reconstruction of the new P1-P2 sequences and the 101 homologous reference sequences.
     doi = 10.1371/journal.pone.0000966

      id = cord-338207-60vrlrim
  author = Lefkowitz, E.J.
   title = Virus Databases
    date = 2008-07-30
keywords = NCBI; database; datum; information; sequence
 summary = (Each arrow points to the table containing the primary key.) Tables are color-coded according to the source of the information they contain: yellow, data obtained from the original GenBank sequence record and the ICTV Eighth Report; pink, data obtained from automated annotation or manual curation; blue, controlled vocabularies to ensure data consistency; green, administrative data. While most of us store our BLAST search results as files on our desktop computers, it is useful to store this information within the database to provide rapid access to similarity results for comparative purposes; to use these results to assign genes to orthologous families of related sequences; and to use these results in applications that analyze data in the database and, for example, display the results of an analysis between two or more types of viruses showing shared sets of common genes.
     doi = 10.1016/b978-012374410-4.00719-6

      id = cord-342785-55r01n0x
  author = Lemmon, Gordon H
   title = Predicting the sensitivity and specificity of published real-time PCR assays
    date = 2008-09-25
keywords = PCR; sequence; signature; time
 summary = METHODS: We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. This analysis must include the predicted false negative and false positive rates for the developed signatures, and consider all available public sequence data. A freely available real time PCR analysis tool called TaqSim [4] was used to find public sequences that would match the primer/probe assay in question. However, according to the genomic data available, a better match of primers and probes to target is possible and is usually desired for high sensitivity detection. Current real-time PCR assay design approaches produce signatures with sensitivities generally too low for clinical use. Fifty Seven TaqMan PCR primer/probe combinations we predict to have higher sensitivity/specificity than current published assays. Development of quantitative gene-specific real-time RT-PCR assays for the detection of measles virus in clinical specimens
     doi = 10.1186/1476-0711-7-18

      id = cord-321386-u1imic5l
  author = Li, Chun
   title = Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
    date = 2018-02-17
keywords = Prot; dna; protein; sequence
 summary = METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. Numerical characterization of protein sequences based on the generalized Chou''s pseudo amino acid composition
     doi = 10.2174/1386207321666180130100838

      id = cord-306725-0vam15pt
  author = Li, Hao
   title = First detection and genomic characteristics of bovine torovirus in dairy calves in China
    date = 2020-05-09
keywords = China; sequence
 summary = Sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the S protein compared to the complete S sequences of BToV available in the GenBank database. A phylogenetic analysis based on the complete amino acid sequence of the S protein showed that the BToVs could be separated into four groups (Fig. 2) , designated tentatively as group 1 to group 4. The bovine torovirus strains BToV/SC-1/China and BToV /SC-2/China investigated in this study are indicated by black triangles Fig. 2 Phylogenetic tree based on the deduced 1586-aa sequence of the complete S gene. Moreover, the two Chinese strains shared identical unique amino acid changes in the S and HE genes when compared to the other strains with sequences available in the GenBank database, indicating the unique evolution in Chinese BToV strains. Moreover, two complete BToV genome sequences were obtained from the clinical samples, and these two BToV isolates had unique amino acid changes in the S and HE proteins.
     doi = 10.1007/s00705-020-04657-9

      id = cord-341879-vubszdp2
  author = Li, Lucy M
   title = Genomic analysis of emerging pathogens: methods, application and future trends
    date = 2014-11-22
keywords = disease; population; sequence
 summary = In this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases. In this review, we provide an overview of recent developments in genomic methods in the context of infectious diseases, evaluate integrative methods that incorporate genetic data in epidemiological analysis, and discuss the application of these methods to EIDs. Over the last two decades, sequence data have increased in quality, length and volume due to improvements in the underlying technology and decreasing costs. In recent cases of EIDs, genomic data have helped to classify and characterize the pathogen, uncover the population history of the disease, and produce estimates of epidemiological parameters. Just as compartmental models can be fitted to surveillance data to infer the epidemiological dynamics of an infectious disease (Box 1), the coalescent framework allows inference of population history from pathogen sequences.
     doi = 10.1186/s13059-014-0541-9

      id = cord-345552-h6fwi0qn
  author = Li, Q.-G.
   title = Hydropathic characteristics of adenovirus hexons
    date = 1997-07-01
keywords = dna; hexon; sequence
 summary = The strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. The sequence of the predicted protein, consisting of 937 amino acids, was obtained with the LaserGene software program EditSeq. The hydropathy data of hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 were derived using the prediction method of Kyte-Doolittle in the LaserGene computer program Protean. The nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed serotypes of subgenera B, D and E to be closely related (Table 3 and Fig. 2) . DNA sequence of the adenovirus type 41 hexon gene and predicted structure of the protein
     doi = 10.1007/s007050050162

      id = cord-001537-i34vmfpp
  author = Lima, Francisco Esmaile de Sales
   title = Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil
    date = 2015-02-17
keywords = Circoviridae; Cyclovirus; dna; sequence
 summary = The predicted protein sequences encoded by ORF2 (cap) and ORF1 (rep) of BatCV I-VI genomes were used for phylogenetic analysis with representative and recently discovered circoviruses/cycloviruses; Pepper golden mosaic virus was used as outgroup, as they are somewhat related to other members in the Circoviridae family (Fig. 3A, 3B and 3C ). The phylogenetic analysis constructed based on the alignments of the complete REP and CAP protein confirms that BatCV POA/II and VI cluster into the genus Cyclovirus along with the Chinese cycloviruses sequences clade detected in bat feces [18] and sharing less than 65% of identity at the CAP/REP amino acid level. BatCV POA I and V had a low amino acid identity with CAP (<20%) and REP (<10%) sequences of two other sequences detected in bat feces in this study with known circoviruses/cycloviruses (Table 2) .
     doi = 10.1371/journal.pone.0118070

      id = cord-330312-1pjolkql
  author = Liu, Y.-T.
   title = Infectious Disease Genomics
    date = 2017-01-20
keywords = HGP; genome; human; malaria; sequence
 summary = One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. 16, 17 The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002. 30e32 Genome-sequencing projects for other important human disease vectors are in progress. 38 One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. 48 The completed or ongoing genome projects (Table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. Genome sequence of the human malaria parasite Plasmodium falciparum
     doi = 10.1016/b978-0-12-799942-5.00010-x

      id = cord-265857-fs6dj3dp
  author = Liu, Yu-Tsueng
   title = Infectious Disease Genomics
    date = 2010-12-24
keywords = genome; human; sequence
 summary = The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002 (Gardner et al., 2002; Holt et al., 2002) . Genome sequencing projects for other important human disease vectors are in progress Megy et al., 2009 ). One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. The completed or ongoing genome projects (Table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control.
     doi = 10.1016/b978-0-12-384890-1.00010-8

      id = cord-287658-c2lljdi7
  author = Lopez-Rincon, Alejandro
   title = Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning
    date = 2020-09-10
keywords = CoV-2; RNA; SARS; sequence
 summary = The discovered sequences are first validated on samples from other repositories, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. The discovered sequences are validated on samples from NCBI and GISAID, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. For example, we can use this sequencing data with cDNA, resulting from the PCR of the original viral RNA; e,g, Real-Time PCR amplicons to identify the SARS-CoV-2 16 . The global impact of SARS-CoV-2 prompted researchers to apply effective alignment-free methods to the classification of the virus: For example, in 26 the authors propose the use of Machine Learning Digital Signal Processing for separating the virus from similar strains, with remarkable accuracy. We calculated the frequency of appearance of different primer sets'' sequences used in SARS-CoV-2 RT-PCR tests developed by WHO referral laboratories and compared it to our primer design in the dataset from the GISAID ( Table 2) repository.
     doi = 10.1101/2020.03.13.990242

      id = cord-302161-ytr7ds8i
  author = Lutz, Mirjam
   title = FCoV Viral Sequences of Systemically Infected Healthy Cats Lack Gene Mutations Previously Linked to the Development of FIP
    date = 2020-07-24
keywords = FIP; ORF; sequence; zu1
 summary = 
     doi = 10.3390/pathogens9080603

      id = cord-025948-6dsx7pey
  author = Maitra, Arindam
   title = Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility
    date = 2020-06-04
keywords = India; RNA; SARS; mutation; sequence
 summary = Direct massively parallel sequencing of SARS-CoV-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in Eastern India. We have initiated a study on sequencing of SARS-CoV-2 genome from swab samples obtained from infected individuals from different regions of West Bengal in Eastern India and report here the first nine sequences and the results of analysis of the sequence data with respect to other sequences reported from the country until date. The A2a clade is characterized by the signature nonsynonymous mutations leading to amino acid changes of P323L in the RdRp which is involved in replication of the viral genome and the change of D614G in the Spike glycoprotein which is essential for the entry of the virus in the host cell by binding to the ACE2 receptor. We have also detected emergence of mutations in the important regions of the viral genome including Spike, RdRP and nucleocapsid coding genes.
     doi = 10.1007/s12038-020-00046-1

      id = cord-010161-bcuec2fz
  author = Matson, David O.
   title = IV, 6. Calicivirus RNA recombination
    date = 2004-09-14
keywords = RNA; sequence
 summary = With the description of statistically significant phylogenetic clades within CV genera, data were available to recognize strains that might be natural recombinants within CVs. Two examples are the well-characterized Argentine strain 320 (Arg320) and Snow Mountain virus (SMV), one of the prototype CVs, recognized to be recombinants when the RNA polymerase and capsid regions of these strains were characterized (Hardy et al., 1997; Jiang et al., 1999) (Fig. 2) . While SMV was likely also to be a recombinant virus, the capsid and RNA polymerase region amplicons of SMV were generated separately and that fact did not exclude the possibility of different sources of strains. Infection of single cells simultaneously by two CVs implies absence of immune or molecular and of 40 nt near the 5'' end of that strain''s capsid gene (ID="B" sequence for this Fig.) . The sequence data indicated that recombination in strain Arg320 occurred at the ORF1/capsid gene junction where high sequence identity exists between the putative parent clades.
     doi = 10.1016/s0168-7069(03)09032-3

      id = cord-275258-azpg5yrh
  author = Mead, Dylan J.T.
   title = Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling
    date = 2019-07-26
keywords = CNN; model; sequence; table
 summary = title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. We then present the first use of force-directed graphs to produce an intuitive visualization of sequence space, and select target RdRPs without solved structures for homology modelling. The solved structure has 10 other sequences in its proximity in the three-dimensional space, roughly Table 5 Homology modelling at intra-order, inter-family level.
     doi = 10.1016/j.jmgm.2019.07.014

      id = cord-027316-echxuw74
  author = Modarresi, Kourosh
   title = Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model
    date = 2020-05-22
keywords = model; sequence
 summary = This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. Deep Learning has become a main model in natural language processing applications [6, 7, 11, 22, 38, 55, 64, 71, 75, 78-81, 85, 88, 94] . Though, modified version of RNN like LSTM and GRU have been improvement over RNN (recurrent neural networks) in dealing with vanishing gradients and long-term memory loss, still they suffer from many deficiencies. Given the complexity of these dependencies, a neural network model is used to compute these weights. The embedding regularization is, α Embedding Error 2 (6) Input to any model has to be a number and hence the raw input of words or text sequence needs to be transformed to continuous numbers. Learning phrase representations using RNN encoder-decoder for statistical machine translation
     doi = 10.1007/978-3-030-50420-5_20

      id = cord-325750-x7jpsnxg
  author = Mokili, John L
   title = Metagenomics and future perspectives in virus discovery
    date = 2012-01-20
keywords = Koch; dna; figure; metagenomic; sequence; viral; virus
 summary = 
     doi = 10.1016/j.coviro.2011.12.004

      id = cord-000642-mkwpuav6
  author = Moreira, Rebeca
   title = Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing
    date = 2012-04-19
keywords = Ruditapes; immune; philippinarum; protein; sequence
 summary = title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum. Moreover, a few transcripts encoded by genes putatively involved in the clam immune response against Perkinsus olseni have been reported by cDNA library sequencing [18] . philippinarum transcriptome and another four bivalve species sequences were analyzed by comparative genomics (Crassostrea gigas of the family Ostreidae, Bathymodiolus azoricus and Mytilus galloprovincialis of the family Mytilidae and Laternula elliptica of the family Laternulidae).
     doi = 10.1371/journal.pone.0035009

      id = cord-311240-o0zyt2vb
  author = Motayo, Babatunde Olarenwaju
   title = Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences
    date = 2020-07-27
keywords = Africa; SARS; sequence
 summary = Our study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa. The pathogen was later identified to be a novel coronavirus closely related to the severe acute respiratory syndrome virus (SARS), with a possible bat origin (Zhou et al, 2020) . This study was designed to determine to the genetic diversity and evolutionary history of genome sequences of SARSCoV-2 isolated in Africa. Results of recombination analysis of the African SARSCoV-2 (AfrSARSCoV-2) sequences against references whole genome sequences of SARS, Recombination signals were observed between the African SARSCoV-2 sequences and reference sequence (Major recombinant hCoV-19 Pangolin/Guangu P4L/2017; Minor parent hCoV-19 B batYunan/RaTG13) between the RdRP and S gene regions (Figure 2 ).
     doi = 10.1101/2020.07.27.222901

      id = cord-018459-isbc1r2o
  author = Munjal, Geetika
   title = Phylogenetics Algorithms and Applications
    date = 2018-12-10
keywords = phylogenetic; sequence
 summary = This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences. These alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15] . Constructing phylogenetic trees using multiple sequence alignment
     doi = 10.1007/978-981-13-5934-7_17

      id = cord-264746-gfn312aa
  author = Muse, Spencer
   title = GENOMICS AND BIOINFORMATICS
    date = 2012-03-29
keywords = RNA; dna; figure; gene; genome; sequence
 summary = The success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of DNA and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of DNA; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. Although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today''s environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism''s genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research.
     doi = 10.1016/b978-0-12-238662-6.50015-x

      id = cord-321762-7kiahjyy
  author = Nandy, Ashesh
   title = Chapter 5 The GRANCH Techniques for Analysis of DNA, RNA and Protein Sequences
    date = 2015-12-31
keywords = dna; graphical; protein; representation; sequence
 summary = 
     doi = 10.1016/b978-1-68108-053-6.50005-3

      id = cord-326225-crtpzad7
  author = Neill, John D.
   title = Simultaneous rapid sequencing of multiple RNA virus genomes
    date = 2014-06-01
keywords = RNA; sequence; virus
 summary = This procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3′-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. There is a wealth of information in these isolates, but up till now, it has been time consuming and expensive to sequence these viral genomes, often requiring sets of strain-specific primers for PCR amplification and sequencing. These primers were developed so that the 20 base known sequence was used for PCR amplification of the library as well as served as a barcode for identifying each viral library following pooling and sequencing. This virus, a BVDV 1b strain isolated from alpaca (GenBank accession JX297520.1; Table 2 , library 3, barcode 10), was assembled from Ion Torrent data and was found to have only 1 base difference from the sequence determined earlier (data not shown). One virus, library 1, barcode 9, had only 658 viral sequence reads but 94.4% of the genome was assembled.
     doi = 10.1016/j.jviromet.2014.02.016

      id = cord-014461-2ubh9u8r
  author = Nelson, Oranmiyan W.
   title = Genome sequences published outside of Standards in Genomic Sciences, July - October 2012
    date = 2012-10-10
keywords = Complete; Draft; Genome; Strain; isolate; sequence
 summary = Complete Genome Sequence of Brucella abortus A13334, a New Strain Isolated from the Fetal Gastric Fluid of Dairy Cattle Complete Genome Sequence of Brucella canis Strain HSK A52141, Isolated from the Blood of an Infected Dog Complete Genome Sequence of Streptococcus salivarius PS4, a Strain Isolated from Human Milk Complete Genome Sequences of Probiotic Strains Bifidobacterium animalis subsp. Complete Genome Sequence of Corynebacterium pseudotuberculosis Strain 1/06-A, Isolated from a Horse in North America Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Characterization and Complete Genome Sequence of Human Coronavirus NL63 Isolated in China Complete Genome Sequence of a Novel Pararetrovirus Isolated from Soybean Complete Genome Sequence of a Polyomavirus Isolated from Horses Complete Genome Sequence of a Novel Porcine Sapelovirus Strain YC2011 Isolated from Piglets with Diarrhea Draft Genome Sequence of Aspergillus oryzae Strain 3.042
     doi = 10.4056/sigs.3416907

      id = cord-016293-pyb00pt5
  author = Newell-McGloughlin, Martina
   title = The flowering of the age of Biotechnology 1990–2000
    date = 2006
keywords = FDA; Genome; NIH; RNA; U.S.; University; Venter; cell; disease; dna; gene; human; plant; sequence; technology
 summary = 
     doi = 10.1007/1-4020-5149-2_4

      id = cord-255371-o9oxchq6
  author = Nguyen, Thanh Thi
   title = Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus)
    date = 2020-07-10
keywords = SARS; mutation; protein; sequence
 summary = title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. We use 6,324 SARS-CoV-2 genome sequences collected in 45 countries and deposited to the NCBI GenBank so far and create a spreadsheet dataset of all mutations occurred across different genes. In this paper, to evaluate the possible impacts of genomic mutations on the virus functions, we propose the use of the SSpro/ACCpro 5 methods to predict protein secondary structure and relative solvent accessibility [13] . By comparing the prediction results obtained on the reference genome and mutated genomes, we are able to assess whether the detected mutations have the potential to change the protein structure and solvent accessibility, and thus lead to possible changes of the virus characteristics.
     doi = 10.1101/2020.07.10.171769

      id = cord-012975-u87ol3fs
  author = Ogiwara, Atsushi
   title = Construction of a dictionary of sequence motifs that characterize groups of related proteins
    date = 1992-09-17
keywords = motif; sequence
 summary = An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. The conserved amino acid patterns, often called consensus patterns or sequence motifs (Taylor, 1988; Hodgman, 1989) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. This procedure is applied to the superfamily grouping of the PIR database and a library of sequence motifs is constructed that identifies specific superfamilies. Functional groups of proteins Suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. Because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites.
     doi = 10.1093/protein/5.6.479

      id = cord-355075-ieb35upi
  author = Papenfuss, Anthony T
   title = The immune gene repertoire of an important viral reservoir, the Australian black flying fox
    date = 2012-06-20
keywords = MHC; RNA; bat; gene; sequence
 summary = alecto transcriptome provides information on a variety of immune genes not previously identified in any bat species and represents an important starting point for examining the antiviral activity of these molecules. To enrich for sequences corresponding to cytokines and innate immune genes, the second dataset was derived from pooled total RNA obtained from mitogen-stimulated spleen, white blood cells and lymph node and unstimulated thymus and bone marrow obtained from one pregnant female and one adult male flying fox. A full length transcript, encoding a 667 amino acid protein was identified in our bat transcriptome datasets and found to be orthologous to Mx1 based on comparison with known mammalian Mx1 and Mx2 family members (Figure 4a and data not shown). Genes involved in the adaptive immune system, including MHC class I and II genes and T and B cell receptors and co-receptors were highly represented in both the thymus and pooled datasets providing evidence that bats have all of the components necessary to mount an adaptive immune response.
     doi = 10.1186/1471-2164-13-261

      id = cord-304607-td0776wj
  author = Paszkiewicz, Konrad H.
   title = Omics, Bioinformatics, and Infectious Disease Research
    date = 2010-12-24
keywords = gene; genome; protein; sequence
 summary = 
     doi = 10.1016/b978-0-12-384890-1.00018-2

      id = cord-264135-s2u76pvk
  author = Patel, Amrutlal K.
   title = Complete genome sequence analysis of chicken astrovirus isolate from India
    date = 2016-12-23
keywords = indian; sequence
 summary = Phylogenetic analysis of the astrovirus genomes suggested formation of separate cluster of chicken astroviruses and placed CAstV/INDIA/ANAND/2016 nearest to the CAstV/4175 isolate (Fig. 2) . B-cell epitope analysis of capsid structural protein of identified chicken astrovirus isolate A total of 9-10 epitopes were predicted using SVMTriP using the capsid protein sequence of the astroviruses. Phylogenetic analysis of the genome sequences as well as the protein sequences showed clustering of the CAstV/ INDIA/ANAND/2016 nearest to that of CastV/4175 and CAstV/GA2011 and all four chicken astrovirus formed separate cluster except capsid protein of the CAstV/Poland/G059/ 2014 isolate which was clustered along with the duck astroviruses. The analysis of capsid protein sequence of reported chicken astroviruses from India revealed limited structural divergence suggesting their common ancestral origin and recent emergence. Fig. 4 Phylogenetic relatedness of chicken astrovirus isolate CAstV/India/Anand/2016 ORF2 coding sequences (a) and ORF2 encoded capsid protein (b) with reported Indian isolates based on neighbour-joining method with
     doi = 10.1007/s11259-016-9673-6

      id = cord-341564-fvuwick5
  author = Qi, Zhao-Hui
   title = Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application
    date = 2018-06-12
keywords = protein; sequence
 summary = From these, we can see that physicochemical properties are widely applied with graphical representation of protein sequences by these researchers and their results seem well. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. Therefore, to mine essential information from a protein sequence, we propose an effective graphical method combining physicochemical properties of amino acids and the BLOSUM62 matrix. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids
     doi = 10.1177/1176934318777755

      id = cord-321715-bkfkmtld
  author = Redelings, Benjamin D
   title = Incorporating indel information into phylogeny estimation for rapidly emerging pathogens
    date = 2007-03-14
keywords = alignment; distribution; indel; model; sequence
 summary = To see if indel information improves phylogenetic resolution we compare the number of bi-partitions that are supported under the joint model and the traditional sequential approach, in which topology reconstruction assumes a previously determined alignment. These parameters include a multiple alignment A that specifies the positional homology between the sequences Y, an evolutionary tree (τ, T) where τ is an unrooted bifurcating tree topology and T = (t 1 , ..., t 2N -3 ) is a vector of branch lengths along the edges in τ, and vectors Θ and Λ are parameters that characterize the letter substitution and indel processes respectively. We therefore propose a new pairwise alignment prior that maintains a fixed sequence length distribution φ even when the indel probability varies from branch to branch. Since the joint model balances substitution and indel information as well as taking alignment ambiguity into account we assume that these differences represent an improvement in the accuracy of estimation.
     doi = 10.1186/1471-2148-7-40

      id = cord-267500-x3u9i1vq
  author = Rose, Rebecca
   title = Challenges in the analysis of viral metagenomes
    date = 2016-08-03
keywords = Assembly; Bruijn; read; sequence
 summary = Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of lowfrequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. We recently proposed a method based on numerical sequence representations and digital signal processing data transformation (SPDT) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (Tapinos et al.
     doi = 10.1093/ve/vew022

      id = cord-300149-djclli8n
  author = Ruan, Yijun
   title = Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection
    date = 2003-05-24
keywords = SARS; sequence
 summary = title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection METHODS: We sequenced the entire SARS viral genome of cultured isolates from the index case (SIN2500) presenting in Singapore, from three primary contacts (SIN2774, SIN2748, and SIN2677), and one secondary contact (SIN2679). In addition, a common variant associated with a non-conservative aminoacid change in the S1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the SARS virus in human populations. All genetic variations of Singapore isolates identified when compared with available SARS-CoV genome sequences were further confirmed by primer extension genotyping technology (Sequenom, San Diego, CA, USA). These sequences showed that the genomes of SARS-CoV isolated in Singapore are comprised of 29 711 bases, with the exception of a five-nucleotide deletion in strain SIN2748 and a six-nucleotide deletion in SIN2677.
     doi = 10.1016/s0140-6736(03)13414-9

      id = cord-015850-ef6svn8f
  author = Saitou, Naruya
   title = Eukaryote Genomes
    date = 2013-08-22
keywords = RNA; dna; gene; genome; sequence
 summary = General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] .
     doi = 10.1007/978-1-4471-5304-7_8

      id = cord-264296-0x90yubt
  author = Sawmya, Shashata
   title = Analyzing hCov genome sequences: Applying Machine Intelligence and beyond
    date = 2020-06-03
keywords = China; Coronavirus; India; sequence
 summary = We present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. C. Several CNN-RNN based models are used to predict mutations at specific Sites of Interest (SoIs) of the sars-cov-2 genome sequence followed by further analyses of the same on several South-Asian countries. D. Overall, we present an analysis pipeline that can be further utilized as well as extended and revised (a) to study where a newly discovered genome sequence lies in relation to its predecessors in different regions of the world; (b) to analyse its virulence with respect to the number of deaths its predecessors have caused in their respective countries and (c) to analyse the mutation at specific important sites of the viral genome.
     doi = 10.1101/2020.06.03.131987

      id = cord-268467-btfz6ye8
  author = Schreiber, Steven S.
   title = Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E
    date = 1989-03-31
keywords = HCV-229E; RNA; sequence
 summary = The 3′-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the Coronavirus family and lends support to the theory that this region is important for the replication of negative-strand RNA. This result suggested that the HCV229E subgenomic mRNAs possess a nested-set structure similar to other coronaviruses and that A34 represented a cDNA clone of either the 3''-end of the genomic RNA or the leader sequence. The 3''-noncoding region contains the sequence TGGAAGAGCCA, 75 nucleotides from the 3''-end (Fig. 4) which is relatively conserved among coronaviruses and is found at approximately the same location in all of these viral genomes (Kapke and Brian, 1986; Skinner and Siddell, 1984; Armstrong et a/., 1983; Lapps et al., 1987; Kamahora et a/., 1988; Boursnell et al., 1985) ( Table 1) . Three intergenic regions of coronavirus mouse hepatitis virus strain A59 genome RNA contain a common nucleotide sequence that is homologous to the 3''end of the viral mRNA leader sequence
     doi = 10.1016/0042-6822(89)90050-0

      id = cord-010273-0c56x9f5
  author = Simmonds, Peter
   title = Virology of hepatitis C virus
    date = 2001-10-10
keywords = HCV; RNA; hepatitis; sequence; virus
 summary = 1,2 The identification of HCV led to the development of diagnostic assays for infection, based either on detection of antibody to recombinant polypeptides expressed from cloned HCV sequences or direct detection of virus ribonucleic acid (RNA) sequences by polymerase chain reaction (PCR) using primers complimentary to the HCV genome. 6 ''13 Remarkably, a series of plant viruses that are structurally distinct from each of the mammalian virus groups, and with different genome organizations, have RNA-dependent RNA polymerase amino acid sequences that are perhaps more similar to those of HCV than are the flaviviruses. In contrast to the highly restricted sequence diversity of the 5''NCR and adjacent core region, the two putative envelope genes are highly divergent between different variants of HCV (Table III) 111-114 and show a three-to-four-times higher rate of sequence change with time in persistently infected patients, ll5 Because these proteins are likely to lie on the outside of the virus, they would be the principal targets of the humoral immune response to HCV elicited on infection.
     doi = 10.1016/s0149-2918(96)80193-7

      id = cord-213136-euv6pqh5
  author = Singh, Kulveer
   title = Sequence Effects on Internal Structure of Droplets of Associative Polymers
    date = 2020-05-17
keywords = polymer; sequence
 summary = We study the evolution of internal structure of large droplets (morphology of clusters of stickers) and the kinetics of interconversion between intramolecular and intermolecular associations, for different sequences of our model polymers. Since at t = 0 we begin with a dilute solution of associating polymers in poor solvent in which most of the chains contain intramolecular bonds between their stickers, the observation of a second peak that corresponds to intermolecular bridges means that major molecular rearrangement takes place inside droplets formed by polymers with s8s, 1s6s1 and 2s4s2 sequences. For three of the sequences (s8s, 1s6s1 and 2s4s2) we found that the average spatial distance R ss between the two stickers of a polymer inside the condensed droplet has a bimodal distribution, such that one of the peaks corresponds to intramolecular bonds and the other to intermolecular bridges between clusters (or between different parts of a long fiber of stickers).
     doi = nan

      id = cord-022348-w7z97wir
  author = Sola, Monica
   title = Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing?
    date = 2007-09-02
keywords = HIV-1; RNA; figure; sequence; virus
 summary = An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships. Under the rubric replication, a virus could vary to increase its fitness, exploit different target cells or evade adaptive immune responses. For a given virus, different protein sequence sets were compared to a given reference such as RT in the case of HIV/SIV. Although these data were derived from completely sequenced primate immunodeficiency viral genomes, analyses on larger data sets, such as p17 Gag/p24 Gag or gp120/gp41, yielded relative values that differed from those given in Table 6 .1 by at most 14%. An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships (Table 6 .1). In the clear cases where genetic variation is exploited by RNA viruses, it is used to overcome barriers to transmission set up by the host population, e.g. herd immunity.
     doi = 10.1016/b978-012220360-2/50007-6

      id = cord-266960-kyx6xhvj
  author = Temple, Mark D.
   title = Real-time audio and visual display of the Coronavirus genome
    date = 2020-10-02
keywords = RNA; audio; display; sequence
 summary = The sonification of codons derived from all three reading frames of the viral RNA sequence in combination with sonified metadata provide the framework for this display. CONCLUSION: The auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the RNA genome. Audio generated from each of these sequence motifs and metadata were combined to create a complex auditory display to represent either transcription or translation. High resolution analysis of gene expression in Coronavirus genomes has detected ribosome protected fragments which map to non-canonical ORF''s, these may be novel protein-coding ORFs and short regulatory uORFs. The tool highlights the occurrence of one such uORF of 30 nucleotides (including the stop codon) in the 5′ untranslated region downstream of TRS1 [35] that is not documented in the GenBank metadata. In the Additional file 4: supplementary example ''Sonification Sub-genomic RNA'' the auditory display represents the process of transcription.
     doi = 10.1186/s12859-020-03760-7

      id = cord-300807-9u8idlon
  author = Tong, Joo Chuan
   title = 7 Infectious disease informatics
    date = 2013-12-31
keywords = disease; sequence
 summary = 
     doi = 10.1533/9781908818416.99

      id = cord-254942-g51mjj2b
  author = Touati, Rabeb
   title = New methodology for repetitive sequences identification in human X and Y chromosomes
    date = 2020-10-19
keywords = dna; repetitive; sequence
 summary = 
     doi = 10.1016/j.bspc.2020.102207

      id = cord-301827-a7hnuxy5
  author = Uversky, Vladimir N
   title = A decade and a half of protein intrinsic disorder: Biology still waits for physics
    date = 2013-04-29
keywords = IDPs; bind; disorder; function; interaction; intrinsic; protein; region; sequence; structure
 summary = 94 Therefore, the abundance and peculiarities of the charged residues distribution within the protein sequences might determine physical and biological properties of extended IDPs and IDPRs. Also, simple polymer physics-based reasoning can give reasonably well-justified explanation of the conformational behavior of extended IDPs. In general, the conformational behavior of IDPs is characterized by the low cooperativity (or the complete lack thereof) of the denaturant-induced unfolding, lack of the measurable excess heat absorption peak(s) characteristic for the melting of ordered proteins, "turned out" response to heat and changes in pH, and the ability to gain structure in the presence of various binding partners. 183 This analysis revealed that proteins involved in regulation and execution of PCD possess substantial amount of intrinsic disorder and IDPRs were implemented in a number of crucial functions, such as protein-protein interactions, interactions with other partners including nucleic acids and other ligands, were shown to be enriched in post-translational modification sites, and were characterized by specific evolutionary patterns.
     doi = 10.1002/pro.2261

      id = cord-339209-oe8onyr9
  author = Vasilakis, Nikos
   title = Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range
    date = 2014-05-20
keywords = RNA; figure; mesoniviruse; sequence; virus
 summary = The organization of each genome was similar to that described previously for the mesoniviruses (NDiV, CavV, HanaV, NseV and MenoV), featuring a long 5''-untranslated region (5''-UTR) of 359 to 370 nt, six major long open reading frames (ORFs), and a long terminal region of 1780 to 1804 nt preceding the poly[A] tail ( Figure 2 ). To determine the phylogenetic relationships of the newly identified insect viruses, maximum likelihood (ML) phylogenetic trees were constructed based on the amino acid alignments of ORF2a (unprocessed S protein) and a concatenated region of the highly conserved domains within ORF1ab (3CL pro , RdRp and ZnHel1). A Clustal X alignment of the mesonivirus ORF3a proteins and individual structural analyses using SignalP and TMHMM and NetNGlyc (www.expasy.org) indicated that each is a class I transmembrane glycoprotein with a predicted N-termimal signal peptide, an ectodomain containing a conserved set of 6 cysteine residues and a single conserved N-glycosylation site, a transmembrane domain and a C-terminal cytoplasmic domain ( Figure 4A, 4D) .
     doi = 10.1186/1743-422x-11-97

      id = cord-296691-cg463fbn
  author = Wang, Ren
   title = De novo Sequence Assembly and Characterization of Lycoris aurea Transcriptome Using GS FLX Titanium Platform of 454 Pyrosequencing
    date = 2013-04-09
keywords = Amaryllidaceae; Lycoris; alkaloid; sequence
 summary = 
     doi = 10.1371/journal.pone.0060449

      id = cord-324216-ce3wa889
  author = Wang, Zheng
   title = Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses
    date = 2008-12-01
keywords = HEV; HRV; flu; sequence
 summary = Due to the great genetic diversity of HRV and HEV, in order to ensure that designed probes (referred to as probe sequences) generated from selected database sequences (referred to as prototype regions) would detect and discriminate all serotypes of HRV and HEV, a predictive model was used to assist the microarray design [17] . This study demonstrated the use of an algorithm for the design of probe sets based on an in silico predictive model [17] , developed by our group, that minimized the probes needed for detection and identification of most serotypes of HRV and HEV. A powerful feature of the expanded RPM-Flu v.30/31 resequencing pathogen microarray is that the nucleotide sequences generated from hybridization of the sample RNA/DNA and array-bound probe sets in conjunction with previously developed sequence analysis algorithm CIBSI can be easily interpreted to make serotype or strain identifications.
     doi = 10.1186/1471-2164-9-577

      id = cord-022494-d66rz6dc
  author = Webb, B.
   title = Comparative Modeling of Drug Target Proteins
    date = 2014-10-01
keywords = comparative; model; modeling; sequence; structure
 summary = Comparative modeling consists of four main steps 23 (Figure 2 (a)): (1) fold assignment that identifies similarity between the target sequence of interest and at least one known protein structure (the template); (2) alignment of the target sequence and the template(s); (3) building a model based on the alignment with the chosen template(s); and (4) predicting model errors. Modeller implements comparative protein structure modeling by the satisfaction of spatial restraints that include: (1) homologyderived restraints on the distances and dihedral angles in the target sequence, extracted from its alignment with the template structures; 35 (2) stereochemical restraints such as bond length and bond angle preferences, obtained from the CHARMM-22 molecular mechanics force field; 107 (3) statistical preferences for dihedral angles and nonbonded interatomic distances, obtained from a representative set of known protein structures; 108 and (4) optional manually curated restraints, such as those from NMR spectroscopy, rules of secondary structure packing, cross-linking experiments, fluorescence spectroscopy, image reconstruction from electron microscopy, site-directed mutagenesis, and intuition ( Figure 2(b) ).
     doi = 10.1016/b978-0-12-409547-2.11133-3

      id = cord-311839-61djk4bs
  author = Wei, Dan
   title = A novel hierarchical clustering algorithm for gene sequences
    date = 2012-07-23
keywords = BKM; clustering; dna; sequence
 summary = We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. DMk shows better performance than the k-tuple distance in our experiments, and mBKM outperforms SL, CL, AL, BKM and KM when tested on public gene sequence datasets. In this paper we propose a new alignment-free similarity measure, DMk, based on which we developed mBKM to cluster gene sequences. To evaluate the proposed similarity measure, we test DMk on gene sequence data sets and compare it with the k-tuple distance. Moreover, we use our method, mBKM with similarity measure DMk, in phylogenetic analysis to show how well the genes are grouped together and how well the resulting trees agree with existing phylogenies. In order to illustrate the efficiency of mBKM in gene sequence clustering, we ran mBKM with the k-tuple distance and DMk on real data sets listed in Table 1 .
     doi = 10.1186/1471-2105-13-174

      id = cord-343863-q1y8uscj
  author = Whitney, Joe
   title = Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches
    date = 2005-02-08
keywords = blast; sequence
 summary = ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. The complete ReHAB hits database can then be queried by date using a simple GUI to allow the researcher to easily identify new hits; these are highlighted, and pairwise or multiple alignments can be performed to assess the quality of the match. ReHAB consists of four main components ( Figure 1 ): (1) a MySQL relational database that stores information about hits, including biological sequences, alignments between them, and other categorization and annotation data; (2) a Java server that provides access to programs which cannot be run locally by the client on arbitrary user workstations, such as NCBI BLAST and EMBOSS [12] utilities; (3) a Java Swing graphical client, downloaded and launched on client machines using Java Web Start; (4) and a back-end Java program which runs alignment programs and compiles results in the database.
     doi = 10.1186/1471-2105-6-23

      id = cord-103029-nc5yf6x4
  author = Wichmann, Stefan
   title = Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank
    date = 2020-09-25
keywords = Fig; OLG; SGC; sequence
 summary = In this study the artificially designed sequences are compared to their original sequences in terms of amino acid identity, amino acid similarity, Hidden Markov Model profile and secondary structure in order to determine the impact of OLG construction and which sequences are potentially functional. While the previous study [30] tried to estimate an upper limit of how many domains can be successfully overlapped in at least one reading frame and position, here the average success rate for OLG construction is determined instead, which is more relevant in relation to both understanding constraints on the formation rate of naturally occuring OLGs and in assessing the likelihood of successful synthetic creation of OLGs. These results in one sense give an upper estimate of the ease of creating overlaps as the difficulty of obtaining an overlapping gene pair naturally is not directly addressed here.
     doi = 10.1101/2020.09.25.312959

      id = cord-103297-4stnx8dw
  author = Widrich, Michael
   title = Modern Hopfield Networks and Attention for Immune Repertoire Classification
    date = 2020-08-17
keywords = CMV; CNN; Hopfield; LSTM; MIL; sequence
 summary = In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. DeepRC sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1D convolutions or LSTMs. In this work, we contribute the following: We demonstrate that continuous generalizations of binary modern Hopfield-networks (Krotov & Hopfield, 2016 Demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. We evaluate the predictive performance of DeepRC and other machine learning approaches for the classification of immune repertoires in a large comparative study (Section "Experimental Results") Exponential storage capacity of continuous state modern Hopfield networks with transformer attention as update rule
     doi = 10.1101/2020.04.12.038158

      id = cord-253436-dz84icdc
  author = Wille, Michelle
   title = High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl
    date = 2016-03-03
keywords = Scaup; sequence
 summary = In this study we screened 764 samples from 22 avian species of the orders Anseriformes and Charadriiformes in Sweden collected in 2006/2007 for CoV, with an overall CoV prevalence of 18.7%, which is higher than many other wild bird surveys. Coronavirus sequences from Mallards in this study were highly similar to CoV sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. Despite few studies, small samples sizes and differences in prevalence, what is clear, is that in the Northern Hemisphere waterfowl species, especially dabbling and diving ducks are important in the epidemiology of avian CoVs. It is interesting to note that these patterns are very similar to those found in low pathogenic influenza A viruses: high prevalence in waterfowl and gulls in the Northern Hemisphere [30] , and little host species and temporal structuring within waterfowl derived viruses in the conserved polymerase genes (such as PB2, PB1) [31] .
     doi = 10.1371/journal.pone.0150198

      id = cord-280881-5o38ihe0
  author = Wlodawer, Alexander
   title = A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases
    date = 2003-11-11
keywords = CLN2; enzyme; sedolisin; sequence
 summary = These structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (Ser, Glu, Asp) and by the presence of an Asp in the oxyanion hole [8] . We have now applied the tools of molecular homology modeling to predicting a structure of CLN2 that could be used as a basis for a search for the biological substrates of this family of enzymes and for the design of specific inhibitors. Mammalian enzymes homologous to human CLN2 [2, 4] form a subfamily of sedolisins with highly conserved sequences ( Figure 1 ). Exploiting the sequence similarity between CLN2, sedolisin, and kumamolisin ( Figure 4 ), we have now used the experimentally obtained structures of the latter two enzymes to form a new, homology-derived model of human CLN2.
     doi = 10.1186/1472-6807-3-8

      id = cord-018963-2lia97db
  author = Xu, Ying
   title = Protein Structure Prediction by Protein Threading
    date = 2010-04-29
keywords = fold; protein; sequence; structure; threading
 summary = Their follow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now.
     doi = 10.1007/978-0-387-68825-1_1

      id = cord-010499-yefxrj30
  author = Yelverton, Elizabeth
   title = The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli
    date = 2006-10-27
keywords = Fig; Gallant; HIV; Weiss; sequence
 summary = Ribosomal frameshifting in both rightward and leftward directions has also been shown to occur at certain ''hungry'' codons whose cognate aminoacyi-tRNAs are in short supply (Gallant and Foley, 1980; Weiss and Gailant, 1983; 1986; Gallant et ai, 1985; Kurland and Gallant, 1986) . Not all hungry codons are equally prone to shift: in a survey of 21 frameshift mutations of the rllB gene of phage T4, Weiss and Gallant (1986) found that oniy a minority were phenotypicaily suppressible when challenged by limitation for any of several aminoacyl-tRNAs. The context njies governing ribosome frameshifting at hungry sites are under investigation, and have been defined in a few cases (Weiss et al., 1988; Gallant and Lindsiey, 1992; Peter et ai. coli the rate of ribosomal frameshifting on that sequence can be increased by limitation for leucine, the amino acid encoded at the frameshift site.
     doi = 10.1111/j.1365-2958.1994.tb00310.x

      id = cord-005060-n901y2d4
  author = ZHANG, Feiyun
   title = Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus
    date = 2001
keywords = ORF; RNA; sequence
 summary = The largest ORF 2 encodes a polyprotein of 947 amino acids (103.6 kDa), which codes for a serine protease and an RNA-dependent RNA polymerase. The genome sequence of sobernoviruses has been determined in Southern bean mosaic virus (SBMV)''2,24), CfMV8315), Rice yellow mottle virus (RYMV)") and Lucerne transient streak virus (LTSV, accession number U31286). However, the con-served sequence, WAG + E/D rich sequence is detected in the region, and putative E/S cleavage sites are present on both sides of the region : proteolytic cleavage would result in a protein of 9 kDa. Possibly, the VPg of RGMoV is located between the protease and the RNA-dependent RNA polymerase domains in the same order as in the SBMV ORF 222) (Fig. 3) . In the RGMoV RNA sequence, no ORF corresponds to the second largest product of 68 kDa. The putative replicase of CfMV is translated as part of a single polyprotein by -1 ribosomal frameshifting between two overlapping ORFs having a coding capacity for 60.9 kDa and 56.3 kDa proteins7J8).
     doi = 10.1007/pl00012989

      id = cord-340907-j9i1wlak
  author = Zarai, Yoram
   title = Evolutionary selection against short nucleotide sequences in viruses and their related hosts
    date = 2020-04-27
keywords = ZIKV; sequence; supplementary; virus
 summary = Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. Figure 3A and B depicts the average number of under-represented sequences of size m ¼ 3, 4, and 5 nucleotides, identified in few subsets of viruses in both the original and random variants of the virus. A sampling analysis that we performed (see Supplementary document, Section 2.8) suggests that the number of under-represented sequences identified in dsDNA viruses matches their genomic size, when compared with RNA viruses. To show that the correspondence between selection against short palindromic sequences in viruses and restriction sites cannot be explained by basic coding region features such as amino-acid content and order, codon usage bias and dinucleotide distribution, we also evaluated the overlap between restriction sites and common under-represented sequences of random variants of viruses.
     doi = 10.1093/dnares/dsaa008

      id = cord-266794-oyppubq5
  author = Zhang, Dachuan
   title = SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model
    date = 2020-09-01
keywords = sequence
 summary = title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. To circumvent this limitation, we built an integrated 2019-nCoV scientific resource platform and a consensus sequence-catalytic function model with which we developed novel methodology to analyze pathogen sequences for catalytic functions. In addition, we integrated a consensus sequence-function model (Zhang, et al., 2020) , a genome browser (Ham, et al., 2012) , and a catalytic function annotation tool (Dawson, et al., 2017) into the platform to assist in the research of novel viruses. We built an integrated platform to assist 2019-nCoV research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species.
     doi = 10.1093/bioinformatics/btaa767

      id = cord-344782-ond1ziu5
  author = Zhang, Jing
   title = Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi)
    date = 2018-10-24
keywords = PCR; RNA; River; sequence; virus
 summary = Nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. Following the detection of the novel virus, in November 2015 (about 6 months after the cessation of the outbreak) an intensive survey of the parts of the river where affected turtles had been detected [2] was undertaken by groups of biologists and ecologists and samples collected from a wide range of aquatic species and some terrestrial animals (n = 360) to establish the size of the remaining population and whether any other animals were carrying this virus. BRV, as a novel nidovirus, was isolated from tissues of diseased animals, very high levels of viral RNA were detected in tissues with marked pathological changes and in situ hybridisation assays demonstrated the presence of specific viral RNA in lesions in kidneys and eye tissue-two of the main affected organs.
     doi = 10.1371/journal.pone.0205209

      id = cord-193910-7p3f3znj
  author = Zhang, Xiangxie
   title = Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification
    date = 2020-11-01
keywords = Levenshtein; dna; feature; sequence
 summary = In the experiments, the performances of feature extraction using primers and random DNA sequences will be compared to several other machine learning approaches. Finally, three state-of-the-art methods, namely a con-volutional neural network (CNN), a deep neural network (DNN), and an N-gram probabilistic model, which were fed the unprocessed DNA sequences without prior feature extraction, were tested. Different machine learning algorithms will be trained and tested using each set of feature vectors in the experiments. For each data set, the results of all six machine learning algorithms using the random DNA sequence feature extraction method are presented in Table ( 8) containing mean accuracy and standard deviation over the ten folds of the cross-validation. It can be concluded that the Levenshtein distance feature extraction yields the best and most consistent results across the six different machine learning algorithms when the distance between a primer and a DNA sequence is taken.
     doi = nan

      id = cord-031957-df4luh5v
  author = dos Santos-Silva, Carlos André
   title = Plant Antimicrobial Peptides: State of the Art, In Silico Prediction and Perspectives in the Omics Era
    date = 2020-09-02
keywords = amp; antimicrobial; figure; model; peptide; pin; plant; protein; sequence; structure
 summary = 
     doi = 10.1177/1177932220952739

      id = cord-001835-0s7ok4uw
  author = nan
   title = Abstracts of the 29th Annual Symposium of The Protein Society
    date = 2015-10-01
keywords = ATP; Biology; Ca21; Chemistry; Department; Institute; NADPH; NMR; PDB; RNA; Science; Tau; University; activity; base; bind; binding; cell; change; complex; design; dna; domain; enzyme; form; function; high; interaction; membrane; method; molecular; peptide; process; protein; residue; result; role; sequence; site; structure; study
 summary = Altogether, these results indicate that, although PHDs might be more selective for HIF as a substrate as it was initially thought, the enzymatic activity of the prolyl hydroxylases is possibly influenced by a number of other proteins that can directly bind to PHDs. Non-natural aminoacids via the MIO-enzyme toolkit Alina Filip 1 , Judith H Bartha-V ari 1 , Gergely B an oczy 2 , L aszl o Poppe 2 , Csaba Paizs 1 , Florin-Dan Irimie 1 1 Biocatalysis and Biotransformation Research Group, Department of Chemistry, UBB, 2 Department of Organic Chemistry and Technology An attractive enzymatic route to enantiomerically pure to the highly valuable a-or b-aromatic amino acids involves the use of aromatic ammonia lyases (ALs) and aminomutases (AMs). Continuing our studies of the effect of like-charged residues on protein-folding mechanisms, in this work, we investigated, by means of NMR spectroscopy and molecular-dynamics simulations, two short fragments of the human Pin1 WW domain [hPin1(14-24); hPin1(15-23)] and one single point mutation system derived from hPin1(14-24) in which the original charged residues were replaced with non-polar alanine residues.
     doi = 10.1002/pro.2823

      id = cord-004879-pgyzluwp
  author = nan
   title = Programmed cell death
    date = 1994
keywords = ATP; Basel; Bern; Drosophila; Institut; Lausanne; NMDA; PCR; PKC; RNA; Switzerland; TNF; University; acid; activity; cell; dna; expression; gene; high; human; increase; level; mouse; protein; receptor; result; sequence; study; type
 summary = Furthermore kinetic experiments after complementation of HIV=RT p66 with KIV-RT pSl indicated that HIV-RT pSl can restore rate and extent of strand displacement activity by HIV-RT p66 compared to the HIV-RT heterodimer D66/D51, suggesting a function of the 51 kDa polypeptide, The mouse mammary tumor virus proviral DNA contains an open reading frame in the 3'' long terminal repeat which can code for a 36 kDa polypeptide with a putative transmembrane sequence and five N-linked glycosylation sites. To this end we used constructs encoding the c-fos (and c-jun) genes fused to the hormone-binding domain of the human estrogen receptor, designated c-FosER (and c-JunER), We could show that short-term activation (30 mins.) of c-FosER by estradiole (E2) led to the disruption of epithelial cell polarity within 24 hours, as characterized by the expression of apical and basolateral marker proteins.
     doi = 10.1007/bf02033112

      id = cord-014462-11ggaqf1
  author = nan
   title = Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh
    date = 2011-04-21
keywords = BTV; CMV; CTV; ELISA; India; PCR; Pradesh; RNA; RTBV; disease; dna; gene; isolate; plant; protein; sequence; study; vaccine; virus
 summary = Molecular diagnosis based on reverse transcription (RT)-PCR s.a. one step or nested PCR, nucleic acid sequence based amplification (NASBA), or real time RT-PCR, has gradually replaced the virus isolation method as the new standard for the detection of dengue virus in acute phase serum samples. Non-genetic methods of management of these diseases include quarantine measures, eradication of infected plants and weed hosts, crop rotation, use of certified virus-free seed or planting stock and use of pesticides to control insect vector populations implicated in transmission of viruses. The results of this study indicate that NS1 antigen based ELISA test can be an useful tool to detect the dengue virus infection in patients during the early acute phase of disease since appearance of IgM antibodies usually occur after fifth day of the infection. The studies showed high level of expression in case of constructed vector as compared to infected virus for the specific protein.
     doi = 10.1007/s13337-011-0027-2

      id = cord-014674-ey29970v
  author = nan
   title = Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002
    date = 2003
keywords = Gentechnik; dna; sequence
 summary = title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 We have closely examined the experimental data and the analyses of the nucleotide sequences presented in the report.We find that aside from problematic details of the experimental design and some erratic presentations of the data the results of the study do not provide evidence for the introgression of recombinant DNA from transgenic crop plants into the genomes of ''criollo'' maize. 3. We characterized with the help of BLAST searches those parts of the sequences of the iPCR amplification products that were denoted by Quist and Chapela in their Fig.2 as regions flanking the CMV p-35S sequence.We find that the sequence of AF434754 denoted adh1 in the K1 source of Fig. 2 does not match with the maize adh1 gene. We examined whether the identified regions in the maize genomic DNA from which PCR amplification products were obtained by the authors would perhaps be flanked by primer binding sites.
     doi = 10.1007/s00103-003-0614-5

      id = cord-023208-w99gc5nx
  author = nan
   title = Poster Presentation Abstracts
    date = 2006-09-01
keywords = Fmoc; Gly; HPLC; Lys; NH2; NMR; Pro; RGD; RNA; Tyr; acid; activity; amino; bind; cell; dna; high; interaction; method; peptide; protein; receptor; result; sequence; structure; study
 summary = In order to develop a synthetic protocol by an automated instrumentation, increasing yield, purity of the crude, and reaction time, a microwave-assisted solid phase peptide synthesis was validated comparing the use of the new generation of Triazine-Based Coupling Reagents (TBCRs) with a series of commonly used ones. Ubiquitinium is a well known mechanism in protein degredation of Eukaryotic cells ,in which many obsolte and corrupted three dimentional structure protein ,become marked by covalent attachment of ubuquitin through a multi-step enzymatic pathway.Ubiquitin is a small ,8.5 kDa peptide of 76 amino acid residues that targets such substrtes for proteolysis in proteasome .Recnt studies showed that an extra cellular ubiquitination process also taking place in the epididymes of humans and other animals marks protein on the surface of the defective sperm .it appears that structurally and functionally defective sperm become surface ubiquitinated by epididymal epithelial cells. This head-to-tailcyclized 14-amino-acid peptide contains one disulfide bridge and a lysine residue (Lys5) present in the P1 position, which is responsible for inhibitor specificity.As was reported by us and other groups, SFTI-1 analogues with one cycle only retain trypsin inhibitory activity.
     doi = 10.1002/psc.797

      id = cord-023209-un2ysc2v
  author = nan
   title = Poster Presentations
    date = 2008-10-07
keywords = Ala; Arg; Asp; Fmoc; Glu; Gly; HPLC; Leu; Lys; NMR; Phe; Thr; Trp; Tyr; University; VEGF; Val; acid; activity; amino; bind; cell; dna; high; peptide; pro; protein; receptor; residue; result; sequence; structure; study; synthesis
 summary = Site-specifi c PEGylation of human IgG1-Fab using a rationally designed trypsin variant In the present contribution we report on a novel, highly selective biocatalytic method enabling C-terminal modifi cations of proteins with artifi cial functionalities under native state conditions. Recently, our group report a novel approach to a totally synthetic vaccine which consists of FMDV (Foot and Mouth Disease Virus) VP1 peptides, prepared by covalent conjugation of peptide biomolecules with membrane active carbochain polyelectrolytes In the present study, peptide epitops of VP1 protein both 135-161(P1) amino acid residues (Ser-Lys-Tyr-Ser-Thr-Thr-Gly-Glu-Arg-Thr-Arg-Thr-Arg-Gly-Asp-Leu-Gly-Ala-Leu-Ala-Ala-Arg-Val-Ala-Thr-Gln-Leu-Pro-Ala) and triptophan (Trp) containing on the N terminus 135-161 amino acid residues (Trp-135-161) (P2) were synthesized by using the microwave assisted solid-phase methods. Using as a template a peptide, already identifi ed, with agonist activity against PTPRJ(H-[Cys-His-His-Asn-Leu-Thr-His-Ala-Cys]-OH), here we report a structure-activity study carried out through endocyclic modifi cations (Ala-scan, D-substitutions, single residue deletions, substitutions of the disulfi de bridge) and the preliminary biological results of this set of compounds.
     doi = 10.1002/psc.1090

      id = cord-023647-dlqs8ay9
  author = nan
   title = Sequences and topology
    date = 2003-03-21
keywords = Evolution; Family; Gene; Human; Protein; acid; sequence
 summary = Nucleotide Sequence Analysis of the L G~ne of Vesicular Stomafltia Virus (New Jersey Serotype) --Identification of Conserved Domai~L~ in L Proteins of Nonsegmented Negative-Strand RNA Viruses DERSE I~ Equine Infectious Anemia Virus tat--Insights into the Structure, Function, and Evolution of Lentivtrus tran.~Activator Proteins Ho~tu~ ~ s71 is a Ehylngcueticellly Distinct Human Endogenous Reteovtgal 1Rlement with Structural mad Sequence Homology to Simian Sarcoma Virus (SSV). Distinct Fercedoxins from Rhodobacter-Capsulstus -Complete Amino Acid Sequences and Molecular Evolution Complete Amino Acid Sequence and Homologies of Human Erythrocyte Membrane Protein Band 4.2. Identification of Two Highly Conserved Amino Acid Sequences Amon~ the ~x-subunits and Molecular ~ The Predicted Amino Acid Sequence of ct-lnternexin is that of a novel Neuronal lntegmedla~ ~ent Protein Inttaspecific Evolution of a Gene Family Coding for Urinary Proteins Attalysi~ of CDNA for Human ~ AJudgyrin I~dicltes a Repeated Structure with Homology to Tissue-Differentiation a~td Cell-Cycle Control Protein
     doi = 10.1016/0959-440x(91)90051-t

      id = cord-300796-rmjv56ia
  author = nan
   title = The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation
    date = 1990-09-01
keywords = Fig; p62; protein; sequence
 summary = In this work we show that the p62 protein of Semliki Forest virus contains an uncleaved signal sequence at its NH2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. As the glycosylation of the signal sequence most likely occurs after its release from the ER membrane our results suggest that this region has no role in completing the transfer process. Furthermore, the p62-reporter hybrid should be translocated across microsomal membranes and possibly glycosylated at Asn~3 of the p62 sequence if the 40 residues long NH2-terminal p62 peptide carries a signal sequence. This must involve Asn~3 of the p62 peptide as it is part of the only potential glycosylation site on the hybrid polypeptides (Garoff et al., 1980 ; references on dhfr sequence in legend to Fig. 1) , Finally, we can also conclude that the p62 signal sequence does not provide a stable membrane anchor to the translocated chain.
     doi = nan

      id = cord-256608-ajzk86rq
  author = van Weezep, Erik
   title = PCR diagnostics: In silico validation by an automated tool using freely available software programs
    date = 2019-05-13
keywords = PCR; sequence; silico
 summary = An alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the PCR test as search queries and the program SSEARCH available in the FASTA sequence analysis package (Brenner et al., 1998; Pearson, 1991; Pearson et al., 2017; . The in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective PCR test. To demonstrate the suitability of our in-house developed software tool PCRv, we determined the in silico sensitivity and specificity of three PCR tests for West Nile virus (WNV) recommended by the World Organisation for Animal Health (OIE) (Eiden et al., 2010; Johnson et al., 2001) .
     doi = 10.1016/j.jviromet.2019.05.002