cord-000257-ampip7od	2010	With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. This likely stems from the fact that they are required to participate in many different interactions, which makes selection pressures more stringent and the appearance of the branches on phylogenetic trees relatively short and more difficult to assess when co-evolutionary data in terms of other domains in the same gene family or expression patterns is limited [42, 63] . This approach thus primarily focuses on the similarity and differences of the orthologous genes within network, and is therefore ideally suited for the study of protein domain evolution and has already revealed that species-specific parts Fig.
cord-000473-jpow6iw1	2011	High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Given a collection of 454 pyrosequencing reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population.
cord-000642-mkwpuav6	2012	title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum. Moreover, a few transcripts encoded by genes putatively involved in the clam immune response against Perkinsus olseni have been reported by cDNA library sequencing [18] . philippinarum transcriptome and another four bivalve species sequences were analyzed by comparative genomics (Crassostrea gigas of the family Ostreidae, Bathymodiolus azoricus and Mytilus galloprovincialis of the family Mytilidae and Laternula elliptica of the family Laternulidae).
cord-001340-kqcx7lrq	2014	Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/ DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization.
cord-001537-i34vmfpp	2015	The predicted protein sequences encoded by ORF2 (cap) and ORF1 (rep) of BatCV I-VI genomes were used for phylogenetic analysis with representative and recently discovered circoviruses/cycloviruses; Pepper golden mosaic virus was used as outgroup, as they are somewhat related to other members in the Circoviridae family (Fig. 3A, 3B and 3C ). The phylogenetic analysis constructed based on the alignments of the complete REP and CAP protein confirms that BatCV POA/II and VI cluster into the genus Cyclovirus along with the Chinese cycloviruses sequences clade detected in bat feces [18] and sharing less than 65% of identity at the CAP/REP amino acid level. BatCV POA I and V had a low amino acid identity with CAP (<20%) and REP (<10%) sequences of two other sequences detected in bat feces in this study with known circoviruses/cycloviruses (Table 2) .
cord-001786-ybd8hi8y	2014	These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. Applications include the use of metagenomics for the discovery of novel genetic functionality, 2 for describing microbial ecosystems and tracking their variation, 3 in untargeted medical diagnostics and forensics, 4 and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. The level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database.
cord-001835-0s7ok4uw	2015	Altogether, these results indicate that, although PHDs might be more selective for HIF as a substrate as it was initially thought, the enzymatic activity of the prolyl hydroxylases is possibly influenced by a number of other proteins that can directly bind to PHDs. Non-natural aminoacids via the MIO-enzyme toolkit Alina Filip 1 , Judith H Bartha-V ari 1 , Gergely B an oczy 2 , L aszl o Poppe 2 , Csaba Paizs 1 , Florin-Dan Irimie 1 1 Biocatalysis and Biotransformation Research Group, Department of Chemistry, UBB, 2 Department of Organic Chemistry and Technology An attractive enzymatic route to enantiomerically pure to the highly valuable a-or b-aromatic amino acids involves the use of aromatic ammonia lyases (ALs) and aminomutases (AMs). Continuing our studies of the effect of like-charged residues on protein-folding mechanisms, in this work, we investigated, by means of NMR spectroscopy and molecular-dynamics simulations, two short fragments of the human Pin1 WW domain [hPin1(14-24); hPin1(15-23)] and one single point mutation system derived from hPin1(14-24) in which the original charged residues were replaced with non-polar alanine residues.
cord-001974-wjf3c7a7	2016	Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. The datasets went through a sequential pipeline with modules (in order) of preprocessing, computational subtraction of host sequences, low-complexity sequence removal, sequence assembly, clustering, association to metadata features, and taxonomical annotation. Associations from the shortest mode tended to have higher dispersion in the range of ORs. Furthermore, one block of clustering results using global alignment mode, alignment length based on the shortest contig, and a minimum sequence identity of 90% (c09ËaSyG1), had an overall high range of ORs as well as the highest minimum values. The clusters are significantly associated with lowest p-values to biological features and the species annotations are described by HMP.
cord-002473-2kpxhzbe	2017	Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. The primary protein sequence is read as consecutive order pairs serially from first amino acid to the end of sequence, and each order pair is nothing but a connected edge between the two nodes where nodes in the graph are involved with different chemical groups of amino acids. Our method of phylogenetic tree formation used the dissimilarity matrix which is obtained for every pair of sequence on the basis of chemical group specific score of amino acids. Based on the phylogenetic tree of five members, we find that the PpcA and PpcD, PpcB and PpcE are mostly closed with regards to the frequency of amino acids of respective eight chemical groups.
cord-003316-r5te5xob	2018	WGS-based strain identification gives a far superior resolution In principle, WGS can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. As an example, genome assembly might appear to be a bottleneck for real-time WGS diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. These include, among others: the current costs of WGS, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable AMR and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols.
cord-004862-yv76yvy5	1989	title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit Î²-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. However, the region between the two ORFs is not conserved among species, and this observation is used to indicate possible start and stop codons for the ORFs. ORF-1 encodes a composite protein, and the 5'' half of ORF-1 from L1Oc is related to type II cytoskeletal keratin. The dot-plot analyses in Fig. 6 show that the internal sequence of L1Oc is very similar to both L1Md (mouse) and L1Hs (human) over very long segments, whereas the 5'' and 3'' ends are not conserved between species.
cord-004879-pgyzluwp	1994	Furthermore kinetic experiments after complementation of HIV=RT p66 with KIV-RT pSl indicated that HIV-RT pSl can restore rate and extent of strand displacement activity by HIV-RT p66 compared to the HIV-RT heterodimer D66/D51, suggesting a function of the 51 kDa polypeptide, The mouse mammary tumor virus proviral DNA contains an open reading frame in the 3'' long terminal repeat which can code for a 36 kDa polypeptide with a putative transmembrane sequence and five N-linked glycosylation sites. To this end we used constructs encoding the c-fos (and c-jun) genes fused to the hormone-binding domain of the human estrogen receptor, designated c-FosER (and c-JunER), We could show that short-term activation (30 mins.) of c-FosER by estradiole (E2) led to the disruption of epithelial cell polarity within 24 hours, as characterized by the expression of apical and basolateral marker proteins.
cord-005060-n901y2d4	2001	The largest ORF 2 encodes a polyprotein of 947 amino acids (103.6 kDa), which codes for a serine protease and an RNA-dependent RNA polymerase. The genome sequence of sobernoviruses has been determined in Southern bean mosaic virus (SBMV)''2,24), CfMV8315), Rice yellow mottle virus (RYMV)") and Lucerne transient streak virus (LTSV, accession number U31286). However, the con-served sequence, WAG + E/D rich sequence is detected in the region, and putative E/S cleavage sites are present on both sides of the region : proteolytic cleavage would result in a protein of 9 kDa. Possibly, the VPg of RGMoV is located between the protease and the RNA-dependent RNA polymerase domains in the same order as in the SBMV ORF 222) (Fig. 3) . In the RGMoV RNA sequence, no ORF corresponds to the second largest product of 68 kDa. The putative replicase of CfMV is translated as part of a single polyprotein by -1 ribosomal frameshifting between two overlapping ORFs having a coding capacity for 60.9 kDa and 56.3 kDa proteins7J8).
cord-010161-bcuec2fz	2004	With the description of statistically significant phylogenetic clades within CV genera, data were available to recognize strains that might be natural recombinants within CVs. Two examples are the well-characterized Argentine strain 320 (Arg320) and Snow Mountain virus (SMV), one of the prototype CVs, recognized to be recombinants when the RNA polymerase and capsid regions of these strains were characterized (Hardy et al., 1997; Jiang et al., 1999) (Fig. 2) . While SMV was likely also to be a recombinant virus, the capsid and RNA polymerase region amplicons of SMV were generated separately and that fact did not exclude the possibility of different sources of strains. Infection of single cells simultaneously by two CVs implies absence of immune or molecular and of 40 nt near the 5'' end of that strain''s capsid gene (ID="B" sequence for this Fig.) . The sequence data indicated that recombination in strain Arg320 occurred at the ORF1/capsid gene junction where high sequence identity exists between the putative parent clades.
cord-010260-8lnpujip	1994	
cord-010273-0c56x9f5	2001	1,2 The identification of HCV led to the development of diagnostic assays for infection, based either on detection of antibody to recombinant polypeptides expressed from cloned HCV sequences or direct detection of virus ribonucleic acid (RNA) sequences by polymerase chain reaction (PCR) using primers complimentary to the HCV genome. 6 ''13 Remarkably, a series of plant viruses that are structurally distinct from each of the mammalian virus groups, and with different genome organizations, have RNA-dependent RNA polymerase amino acid sequences that are perhaps more similar to those of HCV than are the flaviviruses. In contrast to the highly restricted sequence diversity of the 5''NCR and adjacent core region, the two putative envelope genes are highly divergent between different variants of HCV (Table III) 111-114 and show a three-to-four-times higher rate of sequence change with time in persistently infected patients, ll5 Because these proteins are likely to lie on the outside of the virus, they would be the principal targets of the humoral immune response to HCV elicited on infection.
cord-010499-yefxrj30	2006	Ribosomal frameshifting in both rightward and leftward directions has also been shown to occur at certain ''hungry'' codons whose cognate aminoacyi-tRNAs are in short supply (Gallant and Foley, 1980; Weiss and Gailant, 1983; 1986; Gallant et ai, 1985; Kurland and Gallant, 1986) . Not all hungry codons are equally prone to shift: in a survey of 21 frameshift mutations of the rllB gene of phage T4, Weiss and Gallant (1986) found that oniy a minority were phenotypicaily suppressible when challenged by limitation for any of several aminoacyl-tRNAs. The context njies governing ribosome frameshifting at hungry sites are under investigation, and have been defined in a few cases (Weiss et al., 1988; Gallant and Lindsiey, 1992; Peter et ai. coli the rate of ribosomal frameshifting on that sequence can be increased by limitation for leucine, the amino acid encoded at the frameshift site.
cord-011565-8ncgldaq	2020	For instance, in (1) a comprehensive review was performed covering probabilistic algorithms and data structures such as MinHash (6) and Locality Sensitive Hashing (LSH) (7) , Count-Min Sketch (CMS) (8) , HyperLogLog (9) and Bloom filters (10) . A more in depth discussion of many of these topics can also be found in (3, 4) includes a thorough review of compressed string indexes, LSH via sketches, CMS, Bloom filters, and minimizers (13) , with accompanying applications in genomics for each. With this approach, RAMBO can determine which datasets contain a given k-mer or sequence using far fewer Bloom filter queries, yielding a very fast sublinear-time sequence search algorithm (68) . One of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically MinHash and Minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses.
cord-012975-u87ol3fs	1992	An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. The conserved amino acid patterns, often called consensus patterns or sequence motifs (Taylor, 1988; Hodgman, 1989) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. This procedure is applied to the superfamily grouping of the PIR database and a library of sequence motifs is constructed that identifies specific superfamilies. Functional groups of proteins Suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. Because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites.
cord-014461-2ubh9u8r	2012	Complete Genome Sequence of Brucella abortus A13334, a New Strain Isolated from the Fetal Gastric Fluid of Dairy Cattle Complete Genome Sequence of Brucella canis Strain HSK A52141, Isolated from the Blood of an Infected Dog Complete Genome Sequence of Streptococcus salivarius PS4, a Strain Isolated from Human Milk Complete Genome Sequences of Probiotic Strains Bifidobacterium animalis subsp. Complete Genome Sequence of Corynebacterium pseudotuberculosis Strain 1/06-A, Isolated from a Horse in North America Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Characterization and Complete Genome Sequence of Human Coronavirus NL63 Isolated in China Complete Genome Sequence of a Novel Pararetrovirus Isolated from Soybean Complete Genome Sequence of a Polyomavirus Isolated from Horses Complete Genome Sequence of a Novel Porcine Sapelovirus Strain YC2011 Isolated from Piglets with Diarrhea Draft Genome Sequence of Aspergillus oryzae Strain 3.042
cord-014462-11ggaqf1	2011	Molecular diagnosis based on reverse transcription (RT)-PCR s.a. one step or nested PCR, nucleic acid sequence based amplification (NASBA), or real time RT-PCR, has gradually replaced the virus isolation method as the new standard for the detection of dengue virus in acute phase serum samples. Non-genetic methods of management of these diseases include quarantine measures, eradication of infected plants and weed hosts, crop rotation, use of certified virus-free seed or planting stock and use of pesticides to control insect vector populations implicated in transmission of viruses. The results of this study indicate that NS1 antigen based ELISA test can be an useful tool to detect the dengue virus infection in patients during the early acute phase of disease since appearance of IgM antibodies usually occur after fifth day of the infection. The studies showed high level of expression in case of constructed vector as compared to infected virus for the specific protein.
cord-014674-ey29970v	2003	title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) fÃ¼r den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission fÃ¼r die Biologische Sicherheit (ZKBS) im Jahr 2002 We have closely examined the experimental data and the analyses of the nucleotide sequences presented in the report.We find that aside from problematic details of the experimental design and some erratic presentations of the data the results of the study do not provide evidence for the introgression of recombinant DNA from transgenic crop plants into the genomes of ''criollo'' maize. 3. We characterized with the help of BLAST searches those parts of the sequences of the iPCR amplification products that were denoted by Quist and Chapela in their Fig.2 as regions flanking the CMV p-35S sequence.We find that the sequence of AF434754 denoted adh1 in the K1 source of Fig. 2 does not match with the maize adh1 gene. We examined whether the identified regions in the maize genomic DNA from which PCR amplification products were obtained by the authors would perhaps be flanked by primer binding sites.
cord-015850-ef6svn8f	2013	General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] .
cord-016293-pyb00pt5	2006	
cord-016594-lj0us1dq	2012	In the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. When looking at a reverse vaccinology process, the discovery of candidate subunit vaccines begins with a microbial genome, perhaps newly sequence, progresses through an extensive computational stage, ultimately to deliver a shortlist of antigens which can be validated through subsequent laboratory examination. Conventional empirical, experimental, laboratory-based microbiological ways to identify putative candidate antigens require cultivation of target pathogenic micro-organisms, followed by teasing out their component proteins, analysis in a series of in-vitro and in-vivo assays, animal models and with the ultimate objective of isolating one or two proteins displaying protective immunity.
cord-016798-tv2ntug6	2019	The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al.
cord-017354-cndb031c	2008	The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. Given a tree and a data matrix of sequences and features, the parsimony method can pinpoint the branches on which certain evolutionary events are inferred to occur between ancestor or descendent. Phylogenetic analysis of large genomic datasets can present several nested NPcomplete problems: multiple alignment, tree-search, and in some cases, gene order and complement differences among organisms. We provide exemplar cases in which phylogenetic analyses of viral genomes have been crucial to understand complex patterns of transmission among animal and human hosts: Severe Acute Respiratory Syndrome (SARS) [KSI03] and influenza [WEB92] . Molecular phylogenetic analyses of the nucleotide or inferred amino acid sequence data from various viral isolates can then be used to reconstruct the history of the transmission events the virus among hosts.
cord-017584-9rx4jlw8	2007	Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. As for the organization of this paper, we develop an effective method for selecting short oligo probes in Section 2 (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in Section 3 with using viral genomic sequences from the Los Alamos National Laboratory and the National Center of Biotechnology Information websites.
cord-017932-vmtjc8ct	2009	The family Enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (Salmonella, Yersinia, Klebsiella, Shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic Escherichia coli K12. To this end, NIAID has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. The availability of microbial and human DNA sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host''s immune response and an individual''s genetic susceptibility to pathogens. The PFGRC was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases.
cord-018133-2otxft31	2006	Experimentation and bioinformatics have divided the research into several areas, and the largest are: (1) genome and protein sequence analysis, (2) macromolecular structure-function analysis, (3) gene expression analysis, and (4) proteomics. With the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. The Entrez system from the National Center for Biological Information (NCBI) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the Human Genome Project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources.
cord-018459-isbc1r2o	2018	This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences. These alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15] . Constructing phylogenetic trees using multiple sequence alignment
cord-018963-2lia97db	2010	Their follow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now.
cord-022348-w7z97wir	2007	An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships. Under the rubric replication, a virus could vary to increase its fitness, exploit different target cells or evade adaptive immune responses. For a given virus, different protein sequence sets were compared to a given reference such as RT in the case of HIV/SIV. Although these data were derived from completely sequenced primate immunodeficiency viral genomes, analyses on larger data sets, such as p17 Gag/p24 Gag or gp120/gp41, yielded relative values that differed from those given in Table 6 .1 by at most 14%. An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships (Table 6 .1). In the clear cases where genetic variation is exploited by RNA viruses, it is used to overcome barriers to transmission set up by the host population, e.g. herd immunity.
cord-022494-d66rz6dc	2014	Comparative modeling consists of four main steps 23 (Figure 2 (a)): (1) fold assignment that identifies similarity between the target sequence of interest and at least one known protein structure (the template); (2) alignment of the target sequence and the template(s); (3) building a model based on the alignment with the chosen template(s); and (4) predicting model errors. Modeller implements comparative protein structure modeling by the satisfaction of spatial restraints that include: (1) homologyderived restraints on the distances and dihedral angles in the target sequence, extracted from its alignment with the template structures; 35 (2) stereochemical restraints such as bond length and bond angle preferences, obtained from the CHARMM-22 molecular mechanics force field; 107 (3) statistical preferences for dihedral angles and nonbonded interatomic distances, obtained from a representative set of known protein structures; 108 and (4) optional manually curated restraints, such as those from NMR spectroscopy, rules of secondary structure packing, cross-linking experiments, fluorescence spectroscopy, image reconstruction from electron microscopy, site-directed mutagenesis, and intuition ( Figure 2(b) ).
cord-023208-w99gc5nx	2006	In order to develop a synthetic protocol by an automated instrumentation, increasing yield, purity of the crude, and reaction time, a microwave-assisted solid phase peptide synthesis was validated comparing the use of the new generation of Triazine-Based Coupling Reagents (TBCRs) with a series of commonly used ones. Ubiquitinium is a well known mechanism in protein degredation of Eukaryotic cells ,in which many obsolte and corrupted three dimentional structure protein ,become marked by covalent attachment of ubuquitin through a multi-step enzymatic pathway.Ubiquitin is a small ,8.5 kDa peptide of 76 amino acid residues that targets such substrtes for proteolysis in proteasome .Recnt studies showed that an extra cellular ubiquitination process also taking place in the epididymes of humans and other animals marks protein on the surface of the defective sperm .it appears that structurally and functionally defective sperm become surface ubiquitinated by epididymal epithelial cells. This head-to-tailcyclized 14-amino-acid peptide contains one disulfide bridge and a lysine residue (Lys5) present in the P1 position, which is responsible for inhibitor specificity.As was reported by us and other groups, SFTI-1 analogues with one cycle only retain trypsin inhibitory activity.
cord-023209-un2ysc2v	2008	Site-specifi c PEGylation of human IgG1-Fab using a rationally designed trypsin variant In the present contribution we report on a novel, highly selective biocatalytic method enabling C-terminal modifi cations of proteins with artifi cial functionalities under native state conditions. Recently, our group report a novel approach to a totally synthetic vaccine which consists of FMDV (Foot and Mouth Disease Virus) VP1 peptides, prepared by covalent conjugation of peptide biomolecules with membrane active carbochain polyelectrolytes In the present study, peptide epitops of VP1 protein both 135-161(P1) amino acid residues (Ser-Lys-Tyr-Ser-Thr-Thr-Gly-Glu-Arg-Thr-Arg-Thr-Arg-Gly-Asp-Leu-Gly-Ala-Leu-Ala-Ala-Arg-Val-Ala-Thr-Gln-Leu-Pro-Ala) and triptophan (Trp) containing on the N terminus 135-161 amino acid residues (Trp-135-161) (P2) were synthesized by using the microwave assisted solid-phase methods. Using as a template a peptide, already identifi ed, with agonist activity against PTPRJ(H-[Cys-His-His-Asn-Leu-Thr-His-Ala-Cys]-OH), here we report a structure-activity study carried out through endocyclic modifi cations (Ala-scan, D-substitutions, single residue deletions, substitutions of the disulfi de bridge) and the preliminary biological results of this set of compounds.
cord-023647-dlqs8ay9	2003	Nucleotide Sequence Analysis of the L G~ne of Vesicular Stomafltia Virus (New Jersey Serotype) --Identification of Conserved Domai~L~ in L Proteins of Nonsegmented Negative-Strand RNA Viruses DERSE I~ Equine Infectious Anemia Virus tat--Insights into the Structure, Function, and Evolution of Lentivtrus tran.~Activator Proteins Ho~tu~ ~ s71 is a Ehylngcueticellly Distinct Human Endogenous Reteovtgal 1Rlement with Structural mad Sequence Homology to Simian Sarcoma Virus (SSV). Distinct Fercedoxins from Rhodobacter-Capsulstus -Complete Amino Acid Sequences and Molecular Evolution Complete Amino Acid Sequence and Homologies of Human Erythrocyte Membrane Protein Band 4.2. Identification of Two Highly Conserved Amino Acid Sequences Amon~ the ~x-subunits and Molecular ~ The Predicted Amino Acid Sequence of ct-lnternexin is that of a novel Neuronal lntegmedla~ ~ent Protein Inttaspecific Evolution of a Gene Family Coding for Urinary Proteins Attalysi~ of CDNA for Human ~ AJudgyrin I~dicltes a Repeated Structure with Homology to Tissue-Differentiation a~td Cell-Cycle Control Protein
cord-025610-7vouj8pp	2020	In this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. Recently, Recurrent Neural Networks (RNNs) and their variants such as Long Short Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs) based language models have shown promising results in generating high quality text sequences, especially when the input and output are of variable length. first proposed multiple variants of Backward and Forward (B/F) language models based on GRUs for constrained sentence generation [13] . Therefore, we have proposed a neural probabilistic Backward-Forward architecture that can generate high quality sequences, with word embedding substitution method to satisfy multiple constraints. In this paper, we have proposed a novel method, dubbed Neural Probabilistic Backward-Forward language model and word embedding substitution method to address the issue of lexical constrained sequence generation.
cord-025948-6dsx7pey	2020	Direct massively parallel sequencing of SARS-CoV-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in Eastern India. We have initiated a study on sequencing of SARS-CoV-2 genome from swab samples obtained from infected individuals from different regions of West Bengal in Eastern India and report here the first nine sequences and the results of analysis of the sequence data with respect to other sequences reported from the country until date. The A2a clade is characterized by the signature nonsynonymous mutations leading to amino acid changes of P323L in the RdRp which is involved in replication of the viral genome and the change of D614G in the Spike glycoprotein which is essential for the entry of the virus in the host cell by binding to the ACE2 receptor. We have also detected emergence of mutations in the important regions of the viral genome including Spike, RdRP and nucleocapsid coding genes.
cord-027316-echxuw74	2020	This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. Deep Learning has become a main model in natural language processing applications [6, 7, 11, 22, 38, 55, 64, 71, 75, 78-81, 85, 88, 94] . Though, modified version of RNN like LSTM and GRU have been improvement over RNN (recurrent neural networks) in dealing with vanishing gradients and long-term memory loss, still they suffer from many deficiencies. Given the complexity of these dependencies, a neural network model is used to compute these weights. The embedding regularization is, Î± Embedding Error 2 (6) Input to any model has to be a number and hence the raw input of words or text sequence needs to be transformed to continuous numbers. Learning phrase representations using RNN encoder-decoder for statistical machine translation
cord-031957-df4luh5v	2020	
cord-033010-o5kiadfm	2020	RESULTS: This study describes the detailed computational process by which the 2019-nCoV main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3D structure. Our current study took advantage of the availability of the SARS CoV main proteinase amino acid sequence to map out the nucleotide coding region for the same protein in the 2019-nCoV. The predicted secondary structure composition shows a high degree of alpha helix and beta sheets, respectively, occupying 45 and 47% of the total residues with the percentage loop occupancy at 8% regarded as comparative modeling, constructs atomic models based on known structures or structures that have been determined experimentally and likewise share more than 40% sequence homology.
cord-035033-osjy88rc	2020	Here, we introduce a novel algorithm, RAND-ESMINER, which, by randomly repeating the mining process on a random subset of instances and follow relationships, finds an estimate participation index for event sequences. The RAND-ESMINER uses our pattern growth-based ESGROWTH algorithm [4] as the backbone, where the follow relationships are translated into a directed acyclic graph structure, and randomly permutes the edges of this graph to mine the event sequences. They defined a follow relation between the pointbased event instances of two different event types, presented significance measures for sequences, and introduced two pattern-growth based algorithms for the mining task. In this paper, we will focus on mining STESs using a randomization approach, which will take a set of spatiotemporal event instances as input and returns all the discovered STESs together with a list of estimated participation index values for each STES, obtained from randomized trials.
cord-102766-n6mpdhyu	2020	title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. VirFinder: a novel k-mer based tool for identifying viral sequences from 558 assembled metagenomic data.
cord-103029-nc5yf6x4	2020	In this study the artificially designed sequences are compared to their original sequences in terms of amino acid identity, amino acid similarity, Hidden Markov Model profile and secondary structure in order to determine the impact of OLG construction and which sequences are potentially functional. While the previous study [30] tried to estimate an upper limit of how many domains can be successfully overlapped in at least one reading frame and position, here the average success rate for OLG construction is determined instead, which is more relevant in relation to both understanding constraints on the formation rate of naturally occuring OLGs and in assessing the likelihood of successful synthetic creation of OLGs. These results in one sense give an upper estimate of the ease of creating overlaps as the difficulty of obtaining an overlapping gene pair naturally is not directly addressed here.
cord-103297-4stnx8dw	2020	In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. DeepRC sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1D convolutions or LSTMs. In this work, we contribute the following: We demonstrate that continuous generalizations of binary modern Hopfield-networks (Krotov & Hopfield, 2016 Demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. We evaluate the predictive performance of DeepRC and other machine learning approaches for the classification of immune repertoires in a large comparative study (Section "Experimental Results") Exponential storage capacity of continuous state modern Hopfield networks with transformer attention as update rule
cord-193910-7p3f3znj	2020	In the experiments, the performances of feature extraction using primers and random DNA sequences will be compared to several other machine learning approaches. Finally, three state-of-the-art methods, namely a con-volutional neural network (CNN), a deep neural network (DNN), and an N-gram probabilistic model, which were fed the unprocessed DNA sequences without prior feature extraction, were tested. Different machine learning algorithms will be trained and tested using each set of feature vectors in the experiments. For each data set, the results of all six machine learning algorithms using the random DNA sequence feature extraction method are presented in Table ( 8) containing mean accuracy and standard deviation over the ten folds of the cross-validation. It can be concluded that the Levenshtein distance feature extraction yields the best and most consistent results across the six different machine learning algorithms when the distance between a primer and a DNA sequence is taken.
cord-203232-1nnqx1g9	2020	Using the National Center for Biotechnology Information virus protein database and the DrugVirus database, which provides a comprehensive report of broad-spectrum antiviral agents (BSAAs) and viruses they inhibit, we trained ANN models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. Using sequences for SARS-CoV-2 (the coronavirus that causes COVID-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating COVID-19. For Experiment II, we split the data on virus species, meaning the models were forced to predict drugs for a species that it was not trained on, and have to detect peptide substructures in the amino-acid sequences to suggest drugs. In post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence.
cord-213136-euv6pqh5	2020	We study the evolution of internal structure of large droplets (morphology of clusters of stickers) and the kinetics of interconversion between intramolecular and intermolecular associations, for different sequences of our model polymers. Since at t = 0 we begin with a dilute solution of associating polymers in poor solvent in which most of the chains contain intramolecular bonds between their stickers, the observation of a second peak that corresponds to intermolecular bridges means that major molecular rearrangement takes place inside droplets formed by polymers with s8s, 1s6s1 and 2s4s2 sequences. For three of the sequences (s8s, 1s6s1 and 2s4s2) we found that the average spatial distance R ss between the two stickers of a polymer inside the condensed droplet has a bimodal distribution, such that one of the peaks corresponds to intramolecular bonds and the other to intermolecular bridges between clusters (or between different parts of a long fiber of stickers).
cord-252347-vnn4135b	2007	METHODS AND FINDINGS: To directly type HRVs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5'' noncoding region with homologous sequences of the 101 known serotypes. The degenerate primers EV292 and EV222 for PCR amplification of NIm-1A region were not sensitive enough for direct detection of small amount of HRV in original clinical samples (data not shown), and high titer infected cell lysates of cultured isolates were needed to produce enough PCR product for cloning and sequencing. This new assay had 3 key components: sensitive pan-HRV primers and semi-nested PCR to amplify P1-P2 region from cDNA prepared from original clinical specimens, a sequence database of 260-bp P1-P2 region of 5''NCR of all 101 HRV serotypes to serve as standard references for HRV identification, and phylogenetic tree reconstruction of the new P1-P2 sequences and the 101 homologous reference sequences.
cord-253436-dz84icdc	2016	In this study we screened 764 samples from 22 avian species of the orders Anseriformes and Charadriiformes in Sweden collected in 2006/2007 for CoV, with an overall CoV prevalence of 18.7%, which is higher than many other wild bird surveys. Coronavirus sequences from Mallards in this study were highly similar to CoV sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. Despite few studies, small samples sizes and differences in prevalence, what is clear, is that in the Northern Hemisphere waterfowl species, especially dabbling and diving ducks are important in the epidemiology of avian CoVs. It is interesting to note that these patterns are very similar to those found in low pathogenic influenza A viruses: high prevalence in waterfowl and gulls in the Northern Hemisphere [30] , and little host species and temporal structuring within waterfowl derived viruses in the conserved polymerase genes (such as PB2, PB1) [31] .
cord-254942-g51mjj2b	2020	
cord-255194-4i9fc0r7	2008	An RNase treatment step was added to the SISPA protocol to reduce contaminating exogenous RNAs such as ribosomal RNAs. In the case of polyA-tailed viruses, we perform reverse transcription using a combination of random (FR26RV-N) and poly T tagged (FR40RV-T) primers in order to increase the coverage of the 3'' end ( Figure 2 ). Additionally, in order to capture 5'' ends of viral RNA, a random hexamer primer tagged with a conserved sequence at the 5'' end was added to the Klenow reaction (Figure 2 shows a 5'' oligo specific for rhinoviruses). The results of these experiments demonstrate that the SISPA method is very efficient as a genome sequencing method for samples with greater than 10 6 viral particles per RT-PCR reaction ( Figure 5 ). We strongly anticipate that specific adaptations of the SISPA method to conserved regions of different viruses will demonstrate its versatility in a wide range of viral genome sequencing initiatives.
cord-255371-o9oxchq6	2020	title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. We use 6,324 SARS-CoV-2 genome sequences collected in 45 countries and deposited to the NCBI GenBank so far and create a spreadsheet dataset of all mutations occurred across different genes. In this paper, to evaluate the possible impacts of genomic mutations on the virus functions, we propose the use of the SSpro/ACCpro 5 methods to predict protein secondary structure and relative solvent accessibility [13] . By comparing the prediction results obtained on the reference genome and mutated genomes, we are able to assess whether the detected mutations have the potential to change the protein structure and solvent accessibility, and thus lead to possible changes of the virus characteristics.
cord-256278-jvfjf7aw	2010	title: New method for comparing DNA primary sequences based on a discrimination measure Three years after, Blaisdell (1989) proved that the dissimilarity values observed by using distance measures based on word frequencies are directly related to the ones requiring sequence alignment. In Table 2 , we present the similarity/dissimilarity matrix for the full DNA sequences of bÃglobin gene from 10 species listed in Table 1 by our new method. In Fig. 2, we show the phylogenetic tree of 10 bÃglobin gene sequences based on the distance matrix DM, using NJ method. In this paper, we propose a new method for the similarity analysis of DNA sequences. Our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of DNA sequences. Analysis of similarity/ dissimilarity of DNA sequences based on novel 2-D graphical representation A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words
cord-256608-ajzk86rq	2019	An alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the PCR test as search queries and the program SSEARCH available in the FASTA sequence analysis package (Brenner et al., 1998; Pearson, 1991; Pearson et al., 2017; . The in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective PCR test. To demonstrate the suitability of our in-house developed software tool PCRv, we determined the in silico sensitivity and specificity of three PCR tests for West Nile virus (WNV) recommended by the World Organisation for Animal Health (OIE) (Eiden et al., 2010; Johnson et al., 2001) .
cord-263987-ff6kor0c	2017	BACKGROUND: Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT: This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS: While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
cord-264135-s2u76pvk	2016	Phylogenetic analysis of the astrovirus genomes suggested formation of separate cluster of chicken astroviruses and placed CAstV/INDIA/ANAND/2016 nearest to the CAstV/4175 isolate (Fig. 2) . B-cell epitope analysis of capsid structural protein of identified chicken astrovirus isolate A total of 9-10 epitopes were predicted using SVMTriP using the capsid protein sequence of the astroviruses. Phylogenetic analysis of the genome sequences as well as the protein sequences showed clustering of the CAstV/ INDIA/ANAND/2016 nearest to that of CastV/4175 and CAstV/GA2011 and all four chicken astrovirus formed separate cluster except capsid protein of the CAstV/Poland/G059/ 2014 isolate which was clustered along with the duck astroviruses. The analysis of capsid protein sequence of reported chicken astroviruses from India revealed limited structural divergence suggesting their common ancestral origin and recent emergence. Fig. 4 Phylogenetic relatedness of chicken astrovirus isolate CAstV/India/Anand/2016 ORF2 coding sequences (a) and ORF2 encoded capsid protein (b) with reported Indian isolates based on neighbour-joining method with
cord-264296-0x90yubt	2020	We present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. C. Several CNN-RNN based models are used to predict mutations at specific Sites of Interest (SoIs) of the sars-cov-2 genome sequence followed by further analyses of the same on several South-Asian countries. D. Overall, we present an analysis pipeline that can be further utilized as well as extended and revised (a) to study where a newly discovered genome sequence lies in relation to its predecessors in different regions of the world; (b) to analyse its virulence with respect to the number of deaths its predecessors have caused in their respective countries and (c) to analyse the mutation at specific important sites of the viral genome.
cord-264746-gfn312aa	2012	The success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of DNA and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of DNA; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. Although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today''s environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism''s genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research.
cord-265857-fs6dj3dp	2010	The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002 (Gardner et al., 2002; Holt et al., 2002) . Genome sequencing projects for other important human disease vectors are in progress Megy et al., 2009 ). One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. The completed or ongoing genome projects (Table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control.
cord-266288-buc4dd5y	2019	Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in â(18). The natural vector method performs well on many datasets (Deng et al., 2011; Yu et al., 2013b; Hoang et al., 2016; Li et al., 2016) , however, it only considers the number, average position and dispersion of positions of each nucleotide. In this paper, we propose a new Accumulated Natural Vector (ANV) method, which not only considers the basic property of each nucleotide, but also the covariance between them. In this paper, we propose an Accumulated Natural Vector approach, which projects each sequence into a point in R 18 , where the additional six dimensions describe the covariance between nucleotides.
cord-266794-oyppubq5	2020	title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. To circumvent this limitation, we built an integrated 2019-nCoV scientific resource platform and a consensus sequence-catalytic function model with which we developed novel methodology to analyze pathogen sequences for catalytic functions. In addition, we integrated a consensus sequence-function model (Zhang, et al., 2020) , a genome browser (Ham, et al., 2012) , and a catalytic function annotation tool (Dawson, et al., 2017) into the platform to assist in the research of novel viruses. We built an integrated platform to assist 2019-nCoV research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species.
cord-266960-kyx6xhvj	2020	The sonification of codons derived from all three reading frames of the viral RNA sequence in combination with sonified metadata provide the framework for this display. CONCLUSION: The auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the RNA genome. Audio generated from each of these sequence motifs and metadata were combined to create a complex auditory display to represent either transcription or translation. High resolution analysis of gene expression in Coronavirus genomes has detected ribosome protected fragments which map to non-canonical ORF''s, these may be novel protein-coding ORFs and short regulatory uORFs. The tool highlights the occurrence of one such uORF of 30 nucleotides (including the stop codon) in the 5â² untranslated region downstream of TRS1 [35] that is not documented in the GenBank metadata. In the Additional file 4: supplementary example ''Sonification Sub-genomic RNA'' the auditory display represents the process of transcription.
cord-267500-x3u9i1vq	2016	Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of lowfrequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. We recently proposed a method based on numerical sequence representations and digital signal processing data transformation (SPDT) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (Tapinos et al.
cord-268467-btfz6ye8	1989	The 3â²-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the Coronavirus family and lends support to the theory that this region is important for the replication of negative-strand RNA. This result suggested that the HCV229E subgenomic mRNAs possess a nested-set structure similar to other coronaviruses and that A34 represented a cDNA clone of either the 3''-end of the genomic RNA or the leader sequence. The 3''-noncoding region contains the sequence TGGAAGAGCCA, 75 nucleotides from the 3''-end (Fig. 4) which is relatively conserved among coronaviruses and is found at approximately the same location in all of these viral genomes (Kapke and Brian, 1986; Skinner and Siddell, 1984; Armstrong et a/., 1983; Lapps et al., 1987; Kamahora et a/., 1988; Boursnell et al., 1985) ( Table 1) . Three intergenic regions of coronavirus mouse hepatitis virus strain A59 genome RNA contain a common nucleotide sequence that is homologous to the 3''end of the viral mRNA leader sequence
cord-268549-2lg8i9r1	2012	It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation
cord-274056-9t3kneoo	2019	title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector For beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in Table 1 . The similarity/dissimilarity vectors that are corresponding to beta globin, ND5, and spike protein sequences are illustrated in Tables 9, 10, and 11, respectively, based on the two methods discussed before. The results in Table 10 show that both the magnitude ( 5 ) and the angle ( 5 ) can measure similarity/dissimilarity degree well among ND5 protein sequences as shown in Figure 2 . The similarity/dissimilarity analysis among the seven beta globin sequences measured according to ( 5 ) is illustrated in Table 12 and shown in Figure 4 . The similarity/dissimilarity analysis among the beta globin sequences measured according to (GR spike ) is illustrated in Table 14 and shown in Figure 6 .
cord-275258-azpg5yrh	2019	title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. We then present the first use of force-directed graphs to produce an intuitive visualization of sequence space, and select target RdRPs without solved structures for homology modelling. The solved structure has 10 other sequences in its proximity in the three-dimensional space, roughly Table 5 Homology modelling at intra-order, inter-family level.
cord-279528-41atidai	2019	Each amino acid in the protein sequence is represented by a number, and a new 2D graphical representation is suggested. A new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (A t , SA t ). e 2D graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in e 2D graphical representation of TGEVG from class I and GD03T0013 from SARS_CoV protein sequences is illustrated in Figures 4(a) and 4(b) respectively. A new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean A t and standard deviation SA t of the combined intensity level value A t (i) of the protein sequence. F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids
cord-280881-5o38ihe0	2003	These structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (Ser, Glu, Asp) and by the presence of an Asp in the oxyanion hole [8] . We have now applied the tools of molecular homology modeling to predicting a structure of CLN2 that could be used as a basis for a search for the biological substrates of this family of enzymes and for the design of specific inhibitors. Mammalian enzymes homologous to human CLN2 [2, 4] form a subfamily of sedolisins with highly conserved sequences ( Figure 1 ). Exploiting the sequence similarity between CLN2, sedolisin, and kumamolisin ( Figure 4 ), we have now used the experimentally obtained structures of the latter two enzymes to form a new, homology-derived model of human CLN2.
cord-287634-64zqe4cz	2020	For generating synthetic coding sequences with pre-specified amino acid sequence and desired GC-content, there exist two stochastic methods, multinomial and maximum entropy. In this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired GC-content. Thus, identifying over/under-represented regulatory elements or genome-scale patterns relies on generating random sequences that obey the pre-specified amino acid sequence and GC-content constraints. A more restricted method was presented recently, which the authors named NullSeq. NullSeq [10] uses the maximum entropy approach where the synonymous codon usage probability is derived from a strict function that expresses the expected GC-content in the reference amino acid sequence. We ran both tools, CodSeqGen and NullSeq [10] , to generate 1000 coding sequences given the primary amino acid sequence and the target GC-content of the reference coding sequence. NullSeq: a tool for generating random coding sequences with desired amino acid and GC contents
cord-287658-c2lljdi7	2020	The discovered sequences are first validated on samples from other repositories, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. The discovered sequences are validated on samples from NCBI and GISAID, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. For example, we can use this sequencing data with cDNA, resulting from the PCR of the original viral RNA; e,g, Real-Time PCR amplicons to identify the SARS-CoV-2 16 . The global impact of SARS-CoV-2 prompted researchers to apply effective alignment-free methods to the classification of the virus: For example, in 26 the authors propose the use of Machine Learning Digital Signal Processing for separating the virus from similar strains, with remarkable accuracy. We calculated the frequency of appearance of different primer sets'' sequences used in SARS-CoV-2 RT-PCR tests developed by WHO referral laboratories and compared it to our primer design in the dataset from the GISAID ( Table 2) repository.
cord-291156-zxg3dsm3	2020	
cord-296691-cg463fbn	2013	
cord-300149-djclli8n	2003	title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection METHODS: We sequenced the entire SARS viral genome of cultured isolates from the index case (SIN2500) presenting in Singapore, from three primary contacts (SIN2774, SIN2748, and SIN2677), and one secondary contact (SIN2679). In addition, a common variant associated with a non-conservative aminoacid change in the S1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the SARS virus in human populations. All genetic variations of Singapore isolates identified when compared with available SARS-CoV genome sequences were further confirmed by primer extension genotyping technology (Sequenom, San Diego, CA, USA). These sequences showed that the genomes of SARS-CoV isolated in Singapore are comprised of 29 711 bases, with the exception of a five-nucleotide deletion in strain SIN2748 and a six-nucleotide deletion in SIN2677.
cord-300796-rmjv56ia	1990	In this work we show that the p62 protein of Semliki Forest virus contains an uncleaved signal sequence at its NH2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. As the glycosylation of the signal sequence most likely occurs after its release from the ER membrane our results suggest that this region has no role in completing the transfer process. Furthermore, the p62-reporter hybrid should be translocated across microsomal membranes and possibly glycosylated at Asn~3 of the p62 sequence if the 40 residues long NH2-terminal p62 peptide carries a signal sequence. This must involve Asn~3 of the p62 peptide as it is part of the only potential glycosylation site on the hybrid polypeptides (Garoff et al., 1980 ; references on dhfr sequence in legend to Fig. 1) , Finally, we can also conclude that the p62 signal sequence does not provide a stable membrane anchor to the translocated chain.
cord-300807-9u8idlon	2013	
cord-301827-a7hnuxy5	2013	94 Therefore, the abundance and peculiarities of the charged residues distribution within the protein sequences might determine physical and biological properties of extended IDPs and IDPRs. Also, simple polymer physics-based reasoning can give reasonably well-justified explanation of the conformational behavior of extended IDPs. In general, the conformational behavior of IDPs is characterized by the low cooperativity (or the complete lack thereof) of the denaturant-induced unfolding, lack of the measurable excess heat absorption peak(s) characteristic for the melting of ordered proteins, "turned out" response to heat and changes in pH, and the ability to gain structure in the presence of various binding partners. 183 This analysis revealed that proteins involved in regulation and execution of PCD possess substantial amount of intrinsic disorder and IDPRs were implemented in a number of crucial functions, such as protein-protein interactions, interactions with other partners including nucleic acids and other ligands, were shown to be enriched in post-translational modification sites, and were characterized by specific evolutionary patterns.
cord-302161-ytr7ds8i	2020	
cord-302798-q0mbngqy	2018	In this study, the role of circoviruses (CVs) in mink acute gastroenteritis was investigated, and the MiCV genome was molecularly characterized through sequence analysis. MiCVs and previously characterized CVs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus DNA replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (Rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5Ê¹ region; and (iv) motifs in Rep. Pairwise comparisons showed that the capsid proteins of MiCV shared the highest amino acid sequence identity with those of porcine CV (PCV) 2 (45.4%) and bat CV (BatCV) 1 (45.4%). In our study, sequence analysis confirmed that MiCV genomes displayed the characteristics of members of the genus Circovirus, and the common features included their genome organization, the presence of a potential stem-loop and conserved nonanucleotide motif postulated to be the origin of viral DNA replication, and major ORFs and repeats [26, 27] .
cord-304607-td0776wj	2010	
cord-304869-l6a68tqn	2011	As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of Î²-globin gene across species. How to restrict the graphs representing the sequences to two-dimensional plots and how to avoid degeneracies has been the subject of numerous studies which resulted in many graphical representations (see subsequent chapters). It is shown in the last chapter of this work that by using the four-component spectral representation one can recognize the difference in one base between a pair of sequences so it can be used for single nucleotide polymorfism (SNP) analyses which is subject of many investigation, as for example, in a recent work by Bhasi et al.
cord-306725-0vam15pt	2020	Sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the S protein compared to the complete S sequences of BToV available in the GenBank database. A phylogenetic analysis based on the complete amino acid sequence of the S protein showed that the BToVs could be separated into four groups (Fig. 2) , designated tentatively as group 1 to group 4. The bovine torovirus strains BToV/SC-1/China and BToV /SC-2/China investigated in this study are indicated by black triangles Fig. 2 Phylogenetic tree based on the deduced 1586-aa sequence of the complete S gene. Moreover, the two Chinese strains shared identical unique amino acid changes in the S and HE genes when compared to the other strains with sequences available in the GenBank database, indicating the unique evolution in Chinese BToV strains. Moreover, two complete BToV genome sequences were obtained from the clinical samples, and these two BToV isolates had unique amino acid changes in the S and HE proteins.
cord-310734-6v7oru2l	2020	
cord-311240-o0zyt2vb	2020	Our study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa. The pathogen was later identified to be a novel coronavirus closely related to the severe acute respiratory syndrome virus (SARS), with a possible bat origin (Zhou et al, 2020) . This study was designed to determine to the genetic diversity and evolutionary history of genome sequences of SARSCoV-2 isolated in Africa. Results of recombination analysis of the African SARSCoV-2 (AfrSARSCoV-2) sequences against references whole genome sequences of SARS, Recombination signals were observed between the African SARSCoV-2 sequences and reference sequence (Major recombinant hCoV-19 Pangolin/Guangu P4L/2017; Minor parent hCoV-19 B batYunan/RaTG13) between the RdRP and S gene regions (Figure 2 ).
cord-311839-61djk4bs	2012	We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. DMk shows better performance than the k-tuple distance in our experiments, and mBKM outperforms SL, CL, AL, BKM and KM when tested on public gene sequence datasets. In this paper we propose a new alignment-free similarity measure, DMk, based on which we developed mBKM to cluster gene sequences. To evaluate the proposed similarity measure, we test DMk on gene sequence data sets and compare it with the k-tuple distance. Moreover, we use our method, mBKM with similarity measure DMk, in phylogenetic analysis to show how well the genes are grouped together and how well the resulting trees agree with existing phylogenies. In order to illustrate the efficiency of mBKM in gene sequence clustering, we ran mBKM with the k-tuple distance and DMk on real data sets listed in Table 1 .
cord-321150-ev6acl7b	2017	Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed. A strong descent or ascent in the middle of a HGRW indicates that one type of informative site exhibits clustering, and the properties of the random walk can be used to compute exact probabilities of this occurring. To illustrate improved runtimes and memory usage of the new 3SEQ algorithm, we searched for recombinants among large sequence data sets of dengue virus serotype 2, Ebola virus, the coronavirus responsible for Middle-East Respiratory Syndrome (MERS) and Zika virus; see table 1. The genomic alignments of MERS and Zika virus contained 1,150 and 2,792 polymorphic sites, respectively, and >99.9% triplets were able to be tested for mosaicism with exact P values.
cord-321386-u1imic5l	2018	METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. Numerical characterization of protein sequences based on the generalized Chou''s pseudo amino acid composition
cord-321715-bkfkmtld	2007	To see if indel information improves phylogenetic resolution we compare the number of bi-partitions that are supported under the joint model and the traditional sequential approach, in which topology reconstruction assumes a previously determined alignment. These parameters include a multiple alignment A that specifies the positional homology between the sequences Y, an evolutionary tree (Ï, T) where Ï is an unrooted bifurcating tree topology and T = (t 1 , ..., t 2N -3 ) is a vector of branch lengths along the edges in Ï, and vectors Î and Î are parameters that characterize the letter substitution and indel processes respectively. We therefore propose a new pairwise alignment prior that maintains a fixed sequence length distribution Ï even when the indel probability varies from branch to branch. Since the joint model balances substitution and indel information as well as taking alignment ambiguity into account we assume that these differences represent an improvement in the accuracy of estimation.
cord-321762-7kiahjyy	2015	
cord-324021-y1vr1db0	1994	
cord-324216-ce3wa889	2008	Due to the great genetic diversity of HRV and HEV, in order to ensure that designed probes (referred to as probe sequences) generated from selected database sequences (referred to as prototype regions) would detect and discriminate all serotypes of HRV and HEV, a predictive model was used to assist the microarray design [17] . This study demonstrated the use of an algorithm for the design of probe sets based on an in silico predictive model [17] , developed by our group, that minimized the probes needed for detection and identification of most serotypes of HRV and HEV. A powerful feature of the expanded RPM-Flu v.30/31 resequencing pathogen microarray is that the nucleotide sequences generated from hybridization of the sample RNA/DNA and array-bound probe sets in conjunction with previously developed sequence analysis algorithm CIBSI can be easily interpreted to make serotype or strain identifications.
cord-325043-vqjhiv7p	1989	
cord-325750-x7jpsnxg	2012	
cord-325985-xfzhn1n1	2007	The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. Our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. The EMBL nucleotide sequence database [July 2007, Release 91; 461,353 nucleic acid sequences (31) ] was chosen as the reference for this study because it is tightly integrated with the Pfam protein family database (23, 32 Taxon growth was estimated using a standard least squares method, with the SPSS statistical package. We have described a method that capitalizes on the Pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions.
cord-326225-crtpzad7	2014	This procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3â²-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. There is a wealth of information in these isolates, but up till now, it has been time consuming and expensive to sequence these viral genomes, often requiring sets of strain-specific primers for PCR amplification and sequencing. These primers were developed so that the 20 base known sequence was used for PCR amplification of the library as well as served as a barcode for identifying each viral library following pooling and sequencing. This virus, a BVDV 1b strain isolated from alpaca (GenBank accession JX297520.1; Table 2 , library 3, barcode 10), was assembled from Ion Torrent data and was found to have only 1 base difference from the sequence determined earlier (data not shown). One virus, library 1, barcode 9, had only 658 viral sequence reads but 94.4% of the genome was assembled.
cord-328259-3g4klpyg	2020	Despite the overrepresentation of dsRNA viruses, our results show that Santiago''s sewage RNA virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). Viral sequences identified as Partitiviridae-like viruses included in the "unclassified RNA viruses ShiM-2016" category in the NCBI taxonomy (~25% abundance; Figure 2B ) and Totiviriade family were also highly abundant in treated and untreated sewage samples from the EU [5, 7] . Therefore, the abundance of these viruses in the Trebal metagenome can expand the known sequence space associated with this family (only 10 genomes are currently available in the NCBI database) and contribute to a better understanding of the bacteriophage biology related to RNA genomes. Taken together, our results show that metagenomic surveys of RNA viruses in sewage samples and the use of HMMs could uncover extraordinary viral diversity through the detection of remote homologs in these human-impacted environments.
cord-328644-odtue60a	2020	These variants might arise during the spread of the epidemic, as viruses are known for their high frequency of mutation, particularly in single stranded RNA viruses -as in the case of SARS-CoV-2 (SanjuÃ¡n and Domingo-Calap 2016) , which has a single, positive-strand RNA genome. To have a better insight on the history and spread of the COVID-19 pandemic in Italy and thanks to the sequences deposited in the Gisaid database, we identified 7 non synonymous mutations that are differentially frequent in Italian SARS-CoV-2 strains respect to strains circulating globally. Our analysis allowed us to identify 7 positions in four proteins that present drastic changes in amino acid frequencies when comparing Italian sequences with worldwide sequences available on Gisaid.org on April, 10, 2020 ( Figure 1 ).
cord-330067-ujhgb3b0	2007	To overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, CoVDB (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. CoVDB provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. In CoVDB, with the aim of facilitating gene retrieval, we tried to unify the naming of these non-structural proteins from different groups of coronaviruses. When we compared their putative amino acid sequences to the corresponding ones in other group 1 coronavirus genomes using BLAST, as well as searching for conserved domains using motifscan, results showed that the putative proteins encoded by these ORFs belonged to a protein family in Pfam originally assigned as ''Corona_NS3b'' (accession number PF03053). database, CoVDB, of annotated coronavirus genes and genomes, which offers efficient batch sequence retrieval and analysis.
cord-330312-1pjolkql	2017	One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. 16, 17 The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002. 30e32 Genome-sequencing projects for other important human disease vectors are in progress. 38 One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. 48 The completed or ongoing genome projects (Table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. Genome sequence of the human malaria parasite Plasmodium falciparum
cord-331698-rwow1ydx	2020	This review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis. Therefore, the ultra-portability, affordability, and speed in data production make the MinION technology suitable for real-time sequencing in a variety of environments, such as Ebola surveillance in West Africa during the last outbreak [25] , microbial communities inspection in the Arctic [26] , DNA sequencing on the International Space Station (ISS) [27] , and even the recently emerging pandemic coronavirus SARS-CoV-2 [28, 29] . In fact, there are some critical points to be addressed before this technique could become a standard in the industry: (i) sequencing cost should be reduced; (ii) rapid and reliable in situ DNA extraction and library preparation protocols should be designed and validated; (iii) minimal sequencing yields should be determined for each specific application; (iv) fast and real-time pipelines should be created and tested; and (v) level of expertise for managing the data and the samples should be notably reduced.
cord-334127-wjf8t8vp	2015	This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. Whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes (11) (12) (13) , sequence assembly and virus discovery (14) (15) (16) (17) , viral dynamics and evolution (18) (19) (20) and pathogen detection (14, (21) (22) (23) . The second model captures and standardizes host information for all viruses, and whenever a new RefSeq record is created, a manually curated ''viral host'' property is assigned to the relevant species within the NCBI Taxonomy database. The link to the Retrovirus Resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the Retrovirus Genotyping Tool and HIV-1, Human Interaction Database (50, 51) .
cord-334394-qgyzk7th	2020	To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To expand the known repertoire of viruses and catalyse global virus discovery, in particular for Coronaviridae (CoV) family, we developed the Serratus cloud computing architecture for ultra-high throughput sequence alignment. We aligned 3,837,755 public RNA-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all GenBank CoV records clustered at 99% identity plus all non-retroviral RefSeq records for vertebrate viruses (see Methods and Extended Table 1 ). We performed de novo assembly on 52,772 runs potentially containing CoV sequencing reads by combining 37,131 SRA accessions identified by the Serratus search with 18,584 identified by an ongoing cataloguing initiative of the SRA called STAT [5] .
cord-338207-60vrlrim	2008	(Each arrow points to the table containing the primary key.) Tables are color-coded according to the source of the information they contain: yellow, data obtained from the original GenBank sequence record and the ICTV Eighth Report; pink, data obtained from automated annotation or manual curation; blue, controlled vocabularies to ensure data consistency; green, administrative data. While most of us store our BLAST search results as files on our desktop computers, it is useful to store this information within the database to provide rapid access to similarity results for comparative purposes; to use these results to assign genes to orthologous families of related sequences; and to use these results in applications that analyze data in the database and, for example, display the results of an analysis between two or more types of viruses showing shared sets of common genes.
cord-339209-oe8onyr9	2014	The organization of each genome was similar to that described previously for the mesoniviruses (NDiV, CavV, HanaV, NseV and MenoV), featuring a long 5''-untranslated region (5''-UTR) of 359 to 370 nt, six major long open reading frames (ORFs), and a long terminal region of 1780 to 1804 nt preceding the poly[A] tail ( Figure 2 ). To determine the phylogenetic relationships of the newly identified insect viruses, maximum likelihood (ML) phylogenetic trees were constructed based on the amino acid alignments of ORF2a (unprocessed S protein) and a concatenated region of the highly conserved domains within ORF1ab (3CL pro , RdRp and ZnHel1). A Clustal X alignment of the mesonivirus ORF3a proteins and individual structural analyses using SignalP and TMHMM and NetNGlyc (www.expasy.org) indicated that each is a class I transmembrane glycoprotein with a predicted N-termimal signal peptide, an ectodomain containing a conserved set of 6 cysteine residues and a single conserved N-glycosylation site, a transmembrane domain and a C-terminal cytoplasmic domain ( Figure 4A, 4D) .
cord-339915-8j04y50s	2014	Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins. In this paper, we introduce DV-curve graphical representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model of amino acids. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation New graphical representation of a DNA sequence based on the ordered dinucleotides and its application to sequence analysis Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation Similarity/dissimilarity studies of protein sequences based on a new 2d graphical representation
cord-340907-j9i1wlak	2020	Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. Figure 3A and B depicts the average number of under-represented sequences of size m Â¼ 3, 4, and 5 nucleotides, identified in few subsets of viruses in both the original and random variants of the virus. A sampling analysis that we performed (see Supplementary document, Section 2.8) suggests that the number of under-represented sequences identified in dsDNA viruses matches their genomic size, when compared with RNA viruses. To show that the correspondence between selection against short palindromic sequences in viruses and restriction sites cannot be explained by basic coding region features such as amino-acid content and order, codon usage bias and dinucleotide distribution, we also evaluated the overlap between restriction sites and common under-represented sequences of random variants of viruses.
cord-341564-fvuwick5	2018	From these, we can see that physicochemical properties are widely applied with graphical representation of protein sequences by these researchers and their results seem well. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. Therefore, to mine essential information from a protein sequence, we propose an effective graphical method combining physicochemical properties of amino acids and the BLOSUM62 matrix. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids
cord-341879-vubszdp2	2014	In this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases. In this review, we provide an overview of recent developments in genomic methods in the context of infectious diseases, evaluate integrative methods that incorporate genetic data in epidemiological analysis, and discuss the application of these methods to EIDs. Over the last two decades, sequence data have increased in quality, length and volume due to improvements in the underlying technology and decreasing costs. In recent cases of EIDs, genomic data have helped to classify and characterize the pathogen, uncover the population history of the disease, and produce estimates of epidemiological parameters. Just as compartmental models can be fitted to surveillance data to infer the epidemiological dynamics of an infectious disease (Box 1), the coalescent framework allows inference of population history from pathogen sequences.
cord-342785-55r01n0x	2008	METHODS: We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. This analysis must include the predicted false negative and false positive rates for the developed signatures, and consider all available public sequence data. A freely available real time PCR analysis tool called TaqSim [4] was used to find public sequences that would match the primer/probe assay in question. However, according to the genomic data available, a better match of primers and probes to target is possible and is usually desired for high sensitivity detection. Current real-time PCR assay design approaches produce signatures with sensitivities generally too low for clinical use. Fifty Seven TaqMan PCR primer/probe combinations we predict to have higher sensitivity/specificity than current published assays. Development of quantitative gene-specific real-time RT-PCR assays for the detection of measles virus in clinical specimens
cord-343863-q1y8uscj	2005	ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. The complete ReHAB hits database can then be queried by date using a simple GUI to allow the researcher to easily identify new hits; these are highlighted, and pairwise or multiple alignments can be performed to assess the quality of the match. ReHAB consists of four main components ( Figure 1 ): (1) a MySQL relational database that stores information about hits, including biological sequences, alignments between them, and other categorization and annotation data; (2) a Java server that provides access to programs which cannot be run locally by the client on arbitrary user workstations, such as NCBI BLAST and EMBOSS [12] utilities; (3) a Java Swing graphical client, downloaded and launched on client machines using Java Web Start; (4) and a back-end Java program which runs alignment programs and compiles results in the database.
cord-344782-ond1ziu5	2018	Nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. Following the detection of the novel virus, in November 2015 (about 6 months after the cessation of the outbreak) an intensive survey of the parts of the river where affected turtles had been detected [2] was undertaken by groups of biologists and ecologists and samples collected from a wide range of aquatic species and some terrestrial animals (n = 360) to establish the size of the remaining population and whether any other animals were carrying this virus. BRV, as a novel nidovirus, was isolated from tissues of diseased animals, very high levels of viral RNA were detected in tissues with marked pathological changes and in situ hybridisation assays demonstrated the presence of specific viral RNA in lesions in kidneys and eye tissue-two of the main affected organs.
cord-345552-h6fwi0qn	1997	The strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. The sequence of the predicted protein, consisting of 937 amino acids, was obtained with the LaserGene software program EditSeq. The hydropathy data of hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 were derived using the prediction method of Kyte-Doolittle in the LaserGene computer program Protean. The nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed serotypes of subgenera B, D and E to be closely related (Table 3 and Fig. 2) . DNA sequence of the adenovirus type 41 hexon gene and predicted structure of the protein
cord-348427-worgd0xu	2017	The resource now includes expanded data processing pipelines and analysis tools, and supports selection and retrieval of nucleotide and protein sequences from four new viral groups: Ebolaviruses, MERS coronavirus, rotavirus, and Zika virus ( Table 2 ). New processes have been added to parse source descriptor terms from Gen-Bank records and map these to controlled vocabulary, and the resource now supports retrieval of sequences based on standardized isolation source and host terms in addition to standardized gene and protein names. The resource includes data processing pipelines that retrieve sequences from GenBank, provide standardized gene and protein an-notation, and map sequence source descriptors (i.e. metadata) to uniform vocabularies. To resolve this issue, the Virus Variation database loading pipeline parses Gen-Bank records, identifies important metadata terms, such as sample isolation host, date, country and source, and maps these to a standardized vocabulary using a hierarchical approach.
cord-353290-1wi1dhv6	2020	We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses. Nevertheless, RNA viruses do share several common features that drive their evolution: (a) their ultimate dependence on the cell, (b) their high mutation rates, (c) strong purifying selection derived from constraints operating on a small and densely coding genome, and (d) sporadic but powerful positive selection driven by an evolutionary arms race with the host they infect. Two non-mutually exclusive hypotheses may be put forth to explain the consistent pattern of A-richness that we observe: there is selection for more A in viral sequences, and/or there is a mutational bias that leads to more A in genomes of viruses.
cord-354465-5nqrrnqr	1999	Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. In case of one particular class of biopolymers, the ribonucleic acid (RNA) molecules, decoding of information stored in the sequence can be properly decomposed into two steps: (i) formation of the secondary structure, that is, of the pattern of Watson-Crick (and GU) base pairs, and (ii) the embedding of the contact structure in three-dimensional space. On the other hand, an increasing number of experimental findings, as well as results from comparative sequence analysis, suggest that pseudo-knots are important structural elements in many RNA molecules (Westhof and Jaeger, 1992) .
cord-355075-ieb35upi	2012	alecto transcriptome provides information on a variety of immune genes not previously identified in any bat species and represents an important starting point for examining the antiviral activity of these molecules. To enrich for sequences corresponding to cytokines and innate immune genes, the second dataset was derived from pooled total RNA obtained from mitogen-stimulated spleen, white blood cells and lymph node and unstimulated thymus and bone marrow obtained from one pregnant female and one adult male flying fox. A full length transcript, encoding a 667 amino acid protein was identified in our bat transcriptome datasets and found to be orthologous to Mx1 based on comparison with known mammalian Mx1 and Mx2 family members (Figure 4a and data not shown). Genes involved in the adaptive immune system, including MHC class I and II genes and T and B cell receptors and co-receptors were highly represented in both the thymus and pooled datasets providing evidence that bats have all of the components necessary to mount an adaptive immune response.