Summary of your 'study carrel' ============================== This is a summary of your Distant Reader 'study carrel'. The Distant Reader harvested & cached your content into a collection/corpus. It then applied sets of natural language processing and text mining against the collection. The results of this process was reduced to a database file -- a 'study carrel'. The study carrel can then be queried, thus bringing light specific characteristics for your collection. These characteristics can help you summarize the collection as well as enumerate things you might want to investigate more closely. This report is a terse narrative report, and when processing is complete you will be linked to a more complete narrative report. Eric Lease Morgan Number of items in the collection; 'How big is my corpus?' ---------------------------------------------------------- 118 Average length of all items measured in words; "More or less, how big is each item?" ------------------------------------------------------------------------------------ 9974 Average readability score of all items (0 = difficult; 100 = easy) ------------------------------------------------------------------ 51 Top 50 statistically significant keywords; "What is my collection about?" ------------------------------------------------------------------------- 119 sequence 26 RNA 24 dna 21 protein 13 virus 11 structure 11 SARS 10 genome 8 gene 8 Fig 7 model 6 PCR 5 study 5 figure 5 cell 5 acid 4 viral 4 result 4 peptide 4 human 4 high 4 disease 4 bind 4 activity 4 University 4 NMR 3 receptor 3 plant 3 method 3 interaction 3 Table 3 NCBI 3 India 3 CNN 3 CMV 2 vaccine 2 site 2 sequencing 2 residue 2 read 2 probe 2 phylogenetic 2 mutation 2 metagenomic 2 isolate 2 information 2 function 2 feature 2 enzyme 2 domain Top 50 lemmatized nouns; "What is discussed?" --------------------------------------------- 11814 sequence 8325 protein 3842 virus 3361 structure 3331 peptide 2943 cell 2929 gene 2782 genome 2659 method 2476 acid 2212 analysis 1942 % 1936 study 1908 model 1865 datum 1807 number 1726 dna 1707 result 1659 region 1503 activity 1464 amino 1379 alignment 1332 domain 1322 residue 1316 site 1312 group 1287 type 1278 interaction 1277 function 1257 approach 1255 database 1253 disease 1233 time 1179 receptor 1102 sample 1044 information 992 set 981 system 974 value 964 family 948 mutation 941 sequencing 924 length 917 similarity 905 dataset 904 motif 898 membrane 882 c 877 tree 858 specie Top 50 proper nouns; "What are the names of persons or places?" -------------------------------------------------------------- 2113 al 1831 et 1615 . 1463 RNA 808 C 610 Fig 580 SARS 515 PCR 446 Table 393 A 363 N 362 k 360 DNA 359 T 350 S 337 Genome 333 University 328 NMR 324 II 280 DeepRC 273 M 269 ± 264 NCBI 258 Protein 251 CoV-2 246 HCV 242 B 225 L 219 fl 219 HIV 218 K 216 E. 213 j 209 D 207 de 197 GenBank 191 Virus 184 India 170 Human 169 bp 168 LSTM 164 Institute 164 F 164 CNN 160 RT 156 China 154 MS 151 Gly 150 G 150 C. Top 50 personal pronouns nouns; "To whom are things referred?" ------------------------------------------------------------- 6055 we 2912 it 1084 they 873 i 445 them 206 us 138 one 93 he 84 itself 44 themselves 19 you 12 him 11 she 9 me 6 ourselves 4 yÞ 3 l1oc 3 himself 3 her 2 ppifs 2 p450s 2 n40np 2 mine 2 ifnyr-/-mice 2 https://github.com/ababaian/serratus 2 em 2 cb562 1 ³hser 1 yegfp 1 y_~ 1 y401 1 y 1 w@ 1 u 1 tlg1 1 sod-3::gfp 1 s 1 pgem2dhfr 1 p110a 1 ours 1 n−3 1 nthash 1 myself 1 iv-3l3r. 1 insl3 1 icmv1 1 https://serratus.io 1 hc-201 1 hbs06 1 fbp17 Top 50 lemmatized verbs; "What do things do?" --------------------------------------------- 39197 be 7054 have 5776 use 2586 show 2223 base 1559 bind 1556 find 1419 contain 1399 identify 1296 include 1118 provide 1013 do 982 know 975 obtain 922 represent 884 determine 873 compare 842 suggest 840 give 840 generate 829 develop 781 indicate 755 increase 728 follow 704 perform 702 describe 700 see 699 predict 695 allow 694 involve 686 make 663 reveal 655 lead 648 associate 646 form 637 study 624 consider 617 observe 606 report 598 detect 597 result 596 produce 567 require 566 propose 565 relate 542 express 534 induce 530 characterize 528 isolate 526 cause Top 50 lemmatized adjectives and adverbs; "How are things described?" --------------------------------------------------------------------- 2727 not 2245 also 2032 different 1976 high 1970 other 1855 - 1715 viral 1688 more 1678 such 1568 new 1509 human 1313 only 1306 well 1233 most 1165 large 1122 specific 1093 molecular 1091 first 1083 however 1070 structural 919 many 876 small 844 then 833 similar 833 low 831 important 814 single 811 several 794 biological 746 multiple 743 as 738 novel 733 non 727 same 727 immune 726 thus 660 long 656 highly 656 functional 652 possible 649 therefore 634 available 630 very 626 nucleotide 616 further 604 various 600 present 585 genetic 579 genomic 574 phylogenetic Top 50 lemmatized superlative adjectives; "How are things described to the extreme?" ------------------------------------------------------------------------- 394 most 232 good 189 least 146 high 99 Most 85 large 47 small 45 near 38 low 35 short 32 close 31 long 29 late 20 strong 17 great 16 early 15 simple 10 bad 6 fast 6 big 4 easy 4 -peptides 3 weak 3 old 2 ® 2 wide 2 thick 2 new 2 hot 2 clear 2 -which 2 -methylated 2 -hybrid 1 ~15 1 young 1 tight 1 slow 1 slim 1 slight 1 setcov 1 rugosa 1 quick 1 preS1 1 poor 1 poly(U 1 pdqu 1 molossid 1 mean:-42 1 loose 1 little Top 50 lemmatized superlative adverbs; "How do things do to the extreme?" ------------------------------------------------------------------------ 839 most 117 least 29 well 3 shortest 3 long 3 highest 3 clustalw 2 worst 1 ~3 1 smallest 1 near 1 fast Top 50 Internet domains; "What Webbed places are alluded to in this corpus?" ---------------------------------------------------------------------------- 23 github.com 22 www.ncbi.nlm.nih.gov 10 www 7 serratus.io 6 doi.org 4 www.niaid.nih.gov 4 www.ncbi 3 www.ncbi.nlm.nih 3 www.mdpi.com 3 www.gisaid.org 3 www.ebi.ac.uk 3 image.thelancet.com 2 www.who.int 2 www.sternadi.com 2 www.rcsb.org 2 www.broadinstitute.org 2 tree.bio.ed.ac.uk 2 pave.niaid.nih.gov 2 opensource.googleblog.com 2 mol.ax 2 evolution.genetics.washington.edu 2 earthmicrobiome.org 2 creativecommons.org 2 covdb.microbiology.hku.hk 2 compbio.dfci.harvard.edu 2 clients.adaptivebiotech.com 2 blast.ncbi.nlm.nih.gov 2 alla.cs.gsu.edu 1 xmtb 1 wwwmg 1 www3.niaid.nih.gov 1 www.wheatgenome.org 1 www.virology.wisc.edu 1 www.vetmed.ucdavis.edu 1 www.uniprot.org 1 www.spss.com 1 www.sanger.ac.uk 1 www.rostlab.org 1 www.ridom.com 1 www.predictprotein.org 1 www.picb.ac.cn 1 www.phred.org 1 www.pdb.org 1 www.oxfordjournals.org 1 www.ostp 1 www.microsoft.com 1 www.mg-rast.org 1 www.megasoft 1 www.kazusa.or.jp 1 www.jcvi.org Top 50 URLs; "What is hyperlinked from this corpus?" ---------------------------------------------------- 10 http://www 4 http://www.ncbi 4 http://github.com/ml-jku/DeepRC 3 http://www.ncbi.nlm.nih.gov/ 3 http://www.ncbi.nlm.nih 3 http://www.gisaid.org/ 3 http://serratus.io/access 3 http://serratus.io 2 http://www.sternadi.com/phyvirus 2 http://www.niaid.nih.gov/dmid/genomes/ 2 http://www.ncbi.nlm.nih.gov/genome/viruses/variation/ 2 http://www.ncbi.nlm.nih.gov/genome/ 2 http://pave.niaid.nih.gov/ 2 http://github.com/spro/practical-pytorch 2 http://github.com/serratus-bio/tantalus 2 http://github.com/ababaian/serratus 2 http://github.com/ 2 http://earthmicrobiome.org/protocols-and-standards/16s/ 2 http://covdb.microbiology.hku.hk 2 http://clients.adaptivebiotech.com/pub/Emerson-2017-NatGen 2 http://blast.ncbi.nlm.nih.gov/Blast.cgi 2 http://alla.cs.gsu.edu/~software/VISPA/vispa.html 1 http://xmtb 1 http://wwwmg 1 http://www3.niaid.nih.gov/research/topics/ 1 http://www.who.int/tdr 1 http://www.who.int/mediacentre/ 1 http://www.wheatgenome.org/ 1 http://www.virology.wisc.edu/acp/Aligns/seq_align.html 1 http://www.vetmed.ucdavis.edu/ohi/predict/index.cfm 1 http://www.uniprot.org/ 1 http://www.spss.com/ 1 http://www.sanger.ac.uk/Projects/ 1 http://www.rostlab.org/ 1 http://www.ridom.com/seqsphere/ 1 http://www.rcsb.org/structure/6M2N 1 http://www.rcsb.org/pdb/ 1 http://www.predictprotein.org/ 1 http://www.picb.ac.cn/ 1 http://www.phred.org 1 http://www.pdb.org/ 1 http://www.oxfordjournals.org/nar/database/ 1 http://www.ostp 1 http://www.niaid.nih.gov/dmid/genomes/mscs/ 1 http://www.niaid.nih.gov/dmid/ 1 http://www.ncbi.nlm.nih.gov/sutils/pasc 1 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394299/ 1 http://www.ncbi.nlm.nih.gov/nuccore/ 1 http://www.ncbi.nlm.nih.gov/gorf/gorf.html 1 http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses Top 50 email addresses; "Who are you gonna call?" ------------------------------------------------- 2 journals.permissions@oup.com 1 ytliu@ucsd.edu 1 ncbi-help@ncbi.nlm.nih.gov 1 lichwun@163.com 1 krishna.bhattiprolu@uni-graz.at 1 ihh@berkeley.edu 1 baydin2@cs.gsu.edu 1 mara.kozic@liverpool.ac.uk Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?" ------------------------------------------------------------------------------- 11 results are consistent 10 sequence is not 10 sequences are not 8 sequences are similar 7 data are available 7 method does not 7 sequences are also 7 sequences are often 7 sequences do not 7 sequences were then 6 methods do not 6 sequence was not 6 sequences were not 5 data are not 5 domains do not 5 methods are mostly 5 protein is not 5 proteins do not 5 sequence does not 5 sequence is present 5 sequence was also 5 sequences are then 5 sequences were available 4 method is deeprc 4 model is able 4 models using integrated 4 peptides were also 4 protein binding sites 4 proteins are very 4 sequences are available 4 sequences have significantly 4 sequences indicate higher 4 sequences were randomly 4 viruses are not 3 activity is also 3 activity was not 3 alignments were manually 3 data do not 3 gene does not 3 gene finding hmm 3 gene is also 3 gene is not 3 genes containing introns 3 genomes do not 3 group is present 3 groups has consistently 3 method is effective 3 method is highly 3 method is very 3 method performs well Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?" --------------------------------------------------------------------------------------- 5 sequences have no mutation 2 studies are not homologous 1 acids are not common 1 acids are not yet 1 alignment is not trivial 1 alignments is not practical 1 analysis is not unique 1 analysis showed no segregation 1 cells are not only 1 cells are not well 1 data are not complete 1 data are not directly 1 data are not forthcoming 1 data are not readily 1 data are not suitable 1 data do not necessarily 1 data showed no evidence 1 data showed no significant 1 data were not correctly 1 dna was not certain 1 domain is not present 1 domain is not yet 1 domains do not appreciably 1 gene is not essential 1 gene is not lethal 1 genes are not identical 1 method is not generally 1 method is not practical 1 methods are not suffi 1 methods do not always 1 methods do not easily 1 models are not sufficiently 1 models have not yet 1 models were not able 1 number are not necessarily 1 number is not always 1 number is not linear 1 peptide is not effi 1 peptide showed no hemolytic 1 peptides are not ideal 1 peptides are not stable 1 peptides do no exhibit 1 peptides have no structure 1 peptides show no pressor 1 peptides was not easy 1 protein has no role 1 protein is not catalytically 1 protein is not entirely 1 protein is not solely 1 proteins are not strictly A rudimentary bibliography -------------------------- id = cord-274056-9t3kneoo author = Abd Elwahaab, Marwa A. title = A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector date = 2019-05-08 keywords = protein; sequence summary = title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector For beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in Table 1 . The similarity/dissimilarity vectors that are corresponding to beta globin, ND5, and spike protein sequences are illustrated in Tables 9, 10, and 11, respectively, based on the two methods discussed before. The results in Table 10 show that both the magnitude ( 5 ) and the angle ( 5 ) can measure similarity/dissimilarity degree well among ND5 protein sequences as shown in Figure 2 . The similarity/dissimilarity analysis among the seven beta globin sequences measured according to ( 5 ) is illustrated in Table 12 and shown in Figure 4 . The similarity/dissimilarity analysis among the beta globin sequences measured according to (GR spike ) is illustrated in Table 14 and shown in Figure 6 . doi = 10.1155/2019/8702968 id = cord-279528-41atidai author = Abo-Elkhier, Mervat M. title = Measuring Similarity among Protein Sequences Using a New Descriptor date = 2019-11-22 keywords = Table; sequence summary = Each amino acid in the protein sequence is represented by a number, and a new 2D graphical representation is suggested. A new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (A t , SA t ). e 2D graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in e 2D graphical representation of TGEVG from class I and GD03T0013 from SARS_CoV protein sequences is illustrated in Figures 4(a) and 4(b) respectively. A new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean A t and standard deviation SA t of the combined intensity level value A t (i) of the protein sequence. F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids doi = 10.1155/2019/2796971 id = cord-287634-64zqe4cz author = Al-Ssulami, Abdulrakeeb M. title = CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents date = 2020-01-31 keywords = sequence summary = For generating synthetic coding sequences with pre-specified amino acid sequence and desired GC-content, there exist two stochastic methods, multinomial and maximum entropy. In this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired GC-content. Thus, identifying over/under-represented regulatory elements or genome-scale patterns relies on generating random sequences that obey the pre-specified amino acid sequence and GC-content constraints. A more restricted method was presented recently, which the authors named NullSeq. NullSeq [10] uses the maximum entropy approach where the synonymous codon usage probability is derived from a strict function that expresses the expected GC-content in the reference amino acid sequence. We ran both tools, CodSeqGen and NullSeq [10] , to generate 1000 coding sequences given the primary amino acid sequence and the target GC-content of the reference coding sequence. NullSeq: a tool for generating random coding sequences with desired amino acid and GC contents doi = 10.1016/j.ygeno.2019.02.002 id = cord-102766-n6mpdhyu author = Alam, Md. Nafis Ul title = Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses date = 2020-06-25 keywords = RNA; feature; sequence summary = title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. VirFinder: a novel k-mer based tool for identifying viral sequences from 558 assembled metagenomic data. doi = 10.1101/2020.06.25.170779 id = cord-018133-2otxft31 author = Altman, Russ B. title = Bioinformatics date = 2006 keywords = datum; dna; information; sequence; structure summary = Experimentation and bioinformatics have divided the research into several areas, and the largest are: (1) genome and protein sequence analysis, (2) macromolecular structure-function analysis, (3) gene expression analysis, and (4) proteomics. With the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. The Entrez system from the National Center for Biological Information (NCBI) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the Human Genome Project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources. doi = 10.1007/0-387-36278-9_22 id = cord-010260-8lnpujip author = Anthonsen, Henrik W. title = The blind watchmaker and rational protein engineering date = 1994-08-31 keywords = Fig; NMR; electrostatic; method; protein; sequence; structure summary = doi = 10.1016/0168-1656(94)90152-x id = cord-000473-jpow6iw1 author = Astrovskaya, Irina title = Inferring viral quasispecies spectra from 454 pyrosequencing reads date = 2011-07-28 keywords = HCV; read; sequence summary = High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Given a collection of 454 pyrosequencing reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population. doi = 10.1186/1471-2105-12-s6-s1 id = cord-035033-osjy88rc author = Aydin, Berkay title = Spatiotemporal event sequence discovery without thresholds date = 2020-11-09 keywords = ESMINER; RAND; event; sequence summary = Here, we introduce a novel algorithm, RAND-ESMINER, which, by randomly repeating the mining process on a random subset of instances and follow relationships, finds an estimate participation index for event sequences. The RAND-ESMINER uses our pattern growth-based ESGROWTH algorithm [4] as the backbone, where the follow relationships are translated into a directed acyclic graph structure, and randomly permutes the edges of this graph to mine the event sequences. They defined a follow relation between the pointbased event instances of two different event types, presented significance measures for sequences, and introduced two pattern-growth based algorithms for the mining task. In this paper, we will focus on mining STESs using a randomization approach, which will take a set of spatiotemporal event instances as input and returns all the discovered STESs together with a list of estimated participation index values for each STES, obtained from randomized trials. doi = 10.1007/s10707-020-00427-6 id = cord-000257-ampip7od author = Bagowski, Christoph P title = The Nature of Protein Domain Evolution: Shaping the Interaction Network date = 2010-08-17 keywords = domain; evolution; protein; sequence summary = With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. This likely stems from the fact that they are required to participate in many different interactions, which makes selection pressures more stringent and the appearance of the branches on phylogenetic trees relatively short and more difficult to assess when co-evolutionary data in terms of other domains in the same gene family or expression patterns is limited [42, 63] . This approach thus primarily focuses on the similarity and differences of the orthologous genes within network, and is therefore ideally suited for the study of protein domain evolution and has already revealed that species-specific parts Fig. doi = 10.2174/138920210791616725 id = cord-003316-r5te5xob author = Balloux, Francois title = From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic date = 2018-12-17 keywords = AMR; WGS; clinical; genome; sequence; sequencing summary = WGS-based strain identification gives a far superior resolution In principle, WGS can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. As an example, genome assembly might appear to be a bottleneck for real-time WGS diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. These include, among others: the current costs of WGS, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable AMR and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols. doi = 10.1016/j.tim.2018.08.004 id = cord-291156-zxg3dsm3 author = Bernasconi, Anna title = Empowering Virus Sequences Research through Conceptual Modeling date = 2020-05-01 keywords = SARS; VCM; sequence; virus summary = doi = 10.1101/2020.04.29.067637 id = cord-304869-l6a68tqn author = Bielińska-Wąż, Dorota title = Graphical and numerical representations of DNA sequences: statistical aspects of similarity date = 2011-08-28 keywords = Fig; Table; dna; sequence summary = As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. How to restrict the graphs representing the sequences to two-dimensional plots and how to avoid degeneracies has been the subject of numerous studies which resulted in many graphical representations (see subsequent chapters). It is shown in the last chapter of this work that by using the four-component spectral representation one can recognize the difference in one base between a pair of sequences so it can be used for single nucleotide polymorfism (SNP) analyses which is subject of many investigation, as for example, in a recent work by Bhasi et al. doi = 10.1007/s10910-011-9890-8 id = cord-310734-6v7oru2l author = Bolatti, Elisa M. title = A Preliminary Study of the Virome of the South American Free-Tailed Bats (Tadarida brasiliensis) and Identification of Two Novel Mammalian Viruses date = 2020-04-09 keywords = Genomoviridae; Rep; bat; dna; sequence summary = doi = 10.3390/v12040422 id = cord-334127-wjf8t8vp author = Brister, J. Rodney title = NCBI Viral Genomes Resource date = 2015-01-28 keywords = NCBI; Viral; sequence summary = This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. Whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes (11) (12) (13) , sequence assembly and virus discovery (14) (15) (16) (17) , viral dynamics and evolution (18) (19) (20) and pathogen detection (14, (21) (22) (23) . The second model captures and standardizes host information for all viruses, and whenever a new RefSeq record is created, a manually curated ''viral host'' property is assigned to the relevant species within the NCBI Taxonomy database. The link to the Retrovirus Resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the Retrovirus Genotyping Tool and HIV-1, Human Interaction Database (50, 51) . doi = 10.1093/nar/gku1207 id = cord-203232-1nnqx1g9 author = Canturk, Semih title = Machine-Learning Driven Drug Repurposing for COVID-19 date = 2020-06-25 keywords = SARS; drug; sequence; virus summary = Using the National Center for Biotechnology Information virus protein database and the DrugVirus database, which provides a comprehensive report of broad-spectrum antiviral agents (BSAAs) and viruses they inhibit, we trained ANN models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. Using sequences for SARS-CoV-2 (the coronavirus that causes COVID-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating COVID-19. For Experiment II, we split the data on virus species, meaning the models were forced to predict drugs for a species that it was not trained on, and have to detect peptide substructures in the amino-acid sequences to suggest drugs. In post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence. doi = nan id = cord-328644-odtue60a author = Comandatore, Francesco title = Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes date = 2020-05-28 keywords = Coronavirus; SARS; sequence; variant summary = These variants might arise during the spread of the epidemic, as viruses are known for their high frequency of mutation, particularly in single stranded RNA viruses -as in the case of SARS-CoV-2 (Sanjuán and Domingo-Calap 2016) , which has a single, positive-strand RNA genome. To have a better insight on the history and spread of the COVID-19 pandemic in Italy and thanks to the sequences deposited in the Gisaid database, we identified 7 non synonymous mutations that are differentially frequent in Italian SARS-CoV-2 strains respect to strains circulating globally. Our analysis allowed us to identify 7 positions in four proteins that present drastic changes in amino acid frequencies when comparing Italian sequences with worldwide sequences available on Gisaid.org on April, 10, 2020 ( Figure 1 ). doi = 10.1101/2020.04.30.071027 id = cord-268549-2lg8i9r1 author = Dai, Qi title = Sequence comparison via polar coordinates representation and curve tree date = 2012-01-07 keywords = Randic; dna; sequence summary = It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation doi = 10.1016/j.jtbi.2011.09.030 id = cord-002473-2kpxhzbe author = Das, Jayanta Kumar title = Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach date = 2017-03-31 keywords = acid; sequence summary = Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. The primary protein sequence is read as consecutive order pairs serially from first amino acid to the end of sequence, and each order pair is nothing but a connected edge between the two nodes where nodes in the graph are involved with different chemical groups of amino acids. Our method of phylogenetic tree formation used the dissimilarity matrix which is obtained for every pair of sequence on the basis of chemical group specific score of amino acids. Based on the phylogenetic tree of five members, we find that the PpcA and PpcD, PpcB and PpcE are mostly closed with regards to the frequency of amino acids of respective eight chemical groups. doi = 10.1371/journal.pone.0175031 id = cord-004862-yv76yvy5 author = Demers, G. William title = The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin date = 1989 keywords = Fig; ORF-1; sequence summary = title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. However, the region between the two ORFs is not conserved among species, and this observation is used to indicate possible start and stop codons for the ORFs. ORF-1 encodes a composite protein, and the 5'' half of ORF-1 from L1Oc is related to type II cytoskeletal keratin. The dot-plot analyses in Fig. 6 show that the internal sequence of L1Oc is very similar to both L1Md (mouse) and L1Hs (human) over very long segments, whereas the 5'' and 3'' ends are not conserved between species. doi = 10.1007/bf02106177 id = cord-339915-8j04y50s author = Deng, Wei title = DV-Curve Representation of Protein Sequences and Its Application date = 2014-05-08 keywords = dna; sequence summary = Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins. In this paper, we introduce DV-curve graphical representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model of amino acids. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation New graphical representation of a DNA sequence based on the ordered dinucleotides and its application to sequence analysis Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation Similarity/dissimilarity studies of protein sequences based on a new 2d graphical representation doi = 10.1155/2014/203871 id = cord-255194-4i9fc0r7 author = Djikeng, Appolinaire title = Viral genome sequencing by random priming methods date = 2008-01-07 keywords = SISPA; coverage; sequence summary = An RNase treatment step was added to the SISPA protocol to reduce contaminating exogenous RNAs such as ribosomal RNAs. In the case of polyA-tailed viruses, we perform reverse transcription using a combination of random (FR26RV-N) and poly T tagged (FR40RV-T) primers in order to increase the coverage of the 3'' end ( Figure 2 ). Additionally, in order to capture 5'' ends of viral RNA, a random hexamer primer tagged with a conserved sequence at the 5'' end was added to the Klenow reaction (Figure 2 shows a 5'' oligo specific for rhinoviruses). The results of these experiments demonstrate that the SISPA method is very efficient as a genome sequencing method for samples with greater than 10 6 viral particles per RT-PCR reaction ( Figure 5 ). We strongly anticipate that specific adaptations of the SISPA method to conserved regions of different viruses will demonstrate its versatility in a wide range of viral genome sequencing initiatives. doi = 10.1186/1471-2164-9-5 id = cord-266288-buc4dd5y author = Dong, Rui title = A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance date = 2019-04-09 keywords = ANV; Accumulated; sequence summary = Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ(18). The natural vector method performs well on many datasets (Deng et al., 2011; Yu et al., 2013b; Hoang et al., 2016; Li et al., 2016) , however, it only considers the number, average position and dispersion of positions of each nucleotide. In this paper, we propose a new Accumulated Natural Vector (ANV) method, which not only considers the basic property of each nucleotide, but also the covariance between them. In this paper, we propose an Accumulated Natural Vector approach, which projects each sequence into a point in R 18 , where the additional six dimensions describe the covariance between nucleotides. doi = 10.3389/fgene.2019.00234 id = cord-033010-o5kiadfm author = Durojaye, Olanrewaju Ayodeji title = Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study date = 2020-10-02 keywords = Fig; SARS; model; protein; sequence; structure summary = RESULTS: This study describes the detailed computational process by which the 2019-nCoV main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3D structure. Our current study took advantage of the availability of the SARS CoV main proteinase amino acid sequence to map out the nucleotide coding region for the same protein in the 2019-nCoV. The predicted secondary structure composition shows a high degree of alpha helix and beta sheets, respectively, occupying 45 and 47% of the total residues with the percentage loop occupancy at 8% regarded as comparative modeling, constructs atomic models based on known structures or structures that have been determined experimentally and likewise share more than 40% sequence homology. doi = 10.1186/s43042-020-00081-5 id = cord-001786-ybd8hi8y author = Dutilh, Bas E title = Metagenomic ventures into outer sequence space date = 2014-12-15 keywords = sequence; unknown summary = These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. Applications include the use of metagenomics for the discovery of novel genetic functionality, 2 for describing microbial ecosystems and tracking their variation, 3 in untargeted medical diagnostics and forensics, 4 and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. The level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database. doi = 10.4161/21597081.2014.979664 id = cord-334394-qgyzk7th author = Edgar, Robert C. title = Petabase-scale sequence alignment catalyses viral discovery date = 2020-08-10 keywords = Extended; Figure; RNA; SRA; Serratus; genome; sequence summary = To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To expand the known repertoire of viruses and catalyse global virus discovery, in particular for Coronaviridae (CoV) family, we developed the Serratus cloud computing architecture for ultra-high throughput sequence alignment. We aligned 3,837,755 public RNA-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all GenBank CoV records clustered at 99% identity plus all non-retroviral RefSeq records for vertebrate viruses (see Methods and Extended Table 1 ). We performed de novo assembly on 52,772 runs potentially containing CoV sequencing reads by combining 37,131 SRA accessions identified by the Serratus search with 18,584 identified by an ongoing cataloguing initiative of the SRA called STAT [5] . doi = 10.1101/2020.08.07.241729 id = cord-011565-8ncgldaq author = Elworth, R A Leo title = To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics date = 2020-06-04 keywords = Bloom; CMS; hash; sequence; set summary = For instance, in (1) a comprehensive review was performed covering probabilistic algorithms and data structures such as MinHash (6) and Locality Sensitive Hashing (LSH) (7) , Count-Min Sketch (CMS) (8) , HyperLogLog (9) and Bloom filters (10) . A more in depth discussion of many of these topics can also be found in (3, 4) includes a thorough review of compressed string indexes, LSH via sketches, CMS, Bloom filters, and minimizers (13) , with accompanying applications in genomics for each. With this approach, RAMBO can determine which datasets contain a given k-mer or sequence using far fewer Bloom filter queries, yielding a very fast sublinear-time sequence search algorithm (68) . One of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically MinHash and Minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses. doi = 10.1093/nar/gkaa265 id = cord-256278-jvfjf7aw author = Feng, Jie title = New method for comparing DNA primary sequences based on a discrimination measure date = 2010-10-21 keywords = dna; sequence summary = title: New method for comparing DNA primary sequences based on a discrimination measure Three years after, Blaisdell (1989) proved that the dissimilarity values observed by using distance measures based on word frequencies are directly related to the ones requiring sequence alignment. In Table 2 , we present the similarity/dissimilarity matrix for the full DNA sequences of bÀglobin gene from 10 species listed in Table 1 by our new method. In Fig. 2, we show the phylogenetic tree of 10 bÀglobin gene sequences based on the distance matrix DM, using NJ method. In this paper, we propose a new method for the similarity analysis of DNA sequences. Our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of DNA sequences. Analysis of similarity/ dissimilarity of DNA sequences based on novel 2-D graphical representation A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words doi = 10.1016/j.jtbi.2010.07.040 id = cord-016594-lj0us1dq author = Flower, Darren R. title = Identification of Candidate Vaccine Antigens In Silico date = 2012-09-28 keywords = MHC; antigen; prediction; protein; sequence; vaccine summary = In the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. When looking at a reverse vaccinology process, the discovery of candidate subunit vaccines begins with a microbial genome, perhaps newly sequence, progresses through an extensive computational stage, ultimately to deliver a shortlist of antigens which can be validated through subsequent laboratory examination. Conventional empirical, experimental, laboratory-based microbiological ways to identify putative candidate antigens require cultivation of target pathogenic micro-organisms, followed by teasing out their component proteins, analysis in a series of in-vitro and in-vivo assays, animal models and with the ultimate objective of isolating one or two proteins displaying protective immunity. doi = 10.1007/978-1-4614-5070-2_3 id = cord-001974-wjf3c7a7 author = Friis-Nielsen, Jens title = Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers date = 2016-02-19 keywords = Sequencing; Table; cluster; sequence; virus summary = Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. The datasets went through a sequential pipeline with modules (in order) of preprocessing, computational subtraction of host sequences, low-complexity sequence removal, sequence assembly, clustering, association to metadata features, and taxonomical annotation. Associations from the shortest mode tended to have higher dispersion in the range of ORs. Furthermore, one block of clustering results using global alignment mode, alignment length based on the shortest contig, and a minimum sequence identity of 90% (c09ˆaSyG1), had an overall high range of ORs as well as the highest minimum values. The clusters are significantly associated with lowest p-values to biological features and the species annotations are described by HMP. doi = 10.3390/v8020053 id = cord-016798-tv2ntug6 author = Gautam, Ablesh title = Bioinformatics Applications in Advancing Animal Virus Research date = 2019-06-06 keywords = genome; sequence; tool; viral; virus summary = The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al. doi = 10.1007/978-981-13-9073-9_23 id = cord-302798-q0mbngqy author = Ge, Junwei title = Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China date = 2018-06-14 keywords = MiCV; TAC; sequence summary = In this study, the role of circoviruses (CVs) in mink acute gastroenteritis was investigated, and the MiCV genome was molecularly characterized through sequence analysis. MiCVs and previously characterized CVs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus DNA replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (Rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5ʹ region; and (iv) motifs in Rep. Pairwise comparisons showed that the capsid proteins of MiCV shared the highest amino acid sequence identity with those of porcine CV (PCV) 2 (45.4%) and bat CV (BatCV) 1 (45.4%). In our study, sequence analysis confirmed that MiCV genomes displayed the characteristics of members of the genus Circovirus, and the common features included their genome organization, the presence of a potential stem-loop and conserved nonanucleotide motif postulated to be the origin of viral DNA replication, and major ORFs and repeats [26, 27] . doi = 10.1007/s00705-018-3908-5 id = cord-017932-vmtjc8ct author = Georgiev, Vassil St. title = Genomic and Postgenomic Research date = 2009 keywords = NIAID; gene; genome; sequence summary = The family Enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (Salmonella, Yersinia, Klebsiella, Shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic Escherichia coli K12. To this end, NIAID has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. The availability of microbial and human DNA sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host''s immune response and an individual''s genetic susceptibility to pathogens. The PFGRC was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases. doi = 10.1007/978-1-60327-297-1_25 id = cord-325043-vqjhiv7p author = Gorbalenya, Alexander E. title = An NTP-binding motif is the most conserved sequence in a highly diverged monophyletic group of proteins involved in positive strand RNA viral replication date = 1989 keywords = NTP; RNA; protein; sequence summary = doi = 10.1007/bf02102483 id = cord-328259-3g4klpyg author = Guajardo-Leiva, Sergio title = Metagenomic Insights into the Sewage RNA Virosphere of a Large City date = 2020-09-21 keywords = NCBI; RNA; Rotavirus; Trebal; sequence; viral summary = Despite the overrepresentation of dsRNA viruses, our results show that Santiago''s sewage RNA virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). Viral sequences identified as Partitiviridae-like viruses included in the "unclassified RNA viruses ShiM-2016" category in the NCBI taxonomy (~25% abundance; Figure 2B ) and Totiviriade family were also highly abundant in treated and untreated sewage samples from the EU [5, 7] . Therefore, the abundance of these viruses in the Trebal metagenome can expand the known sequence space associated with this family (only 10 genomes are currently available in the NCBI database) and contribute to a better understanding of the bacteriophage biology related to RNA genomes. Taken together, our results show that metagenomic surveys of RNA viruses in sewage samples and the use of HMMs could uncover extraordinary viral diversity through the detection of remote homologs in these human-impacted environments. doi = 10.3390/v12091050 id = cord-354465-5nqrrnqr author = Haslinger, Christian title = RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties date = 1999 keywords = RNA; graph; secondary; sequence; structure summary = Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. In case of one particular class of biopolymers, the ribonucleic acid (RNA) molecules, decoding of information stored in the sequence can be properly decomposed into two steps: (i) formation of the secondary structure, that is, of the pattern of Watson-Crick (and GU) base pairs, and (ii) the embedding of the contact structure in three-dimensional space. On the other hand, an increasing number of experimental findings, as well as results from comparative sequence analysis, suggest that pseudo-knots are important structural elements in many RNA molecules (Westhof and Jaeger, 1992) . doi = 10.1006/bulm.1998.0085 id = cord-348427-worgd0xu author = Hatcher, Eneida L. title = Virus Variation Resource – improved response to emergent viral outbreaks date = 2017-01-04 keywords = Resource; Variation; Virus; sequence summary = The resource now includes expanded data processing pipelines and analysis tools, and supports selection and retrieval of nucleotide and protein sequences from four new viral groups: Ebolaviruses, MERS coronavirus, rotavirus, and Zika virus ( Table 2 ). New processes have been added to parse source descriptor terms from Gen-Bank records and map these to controlled vocabulary, and the resource now supports retrieval of sequences based on standardized isolation source and host terms in addition to standardized gene and protein names. The resource includes data processing pipelines that retrieve sequences from GenBank, provide standardized gene and protein an-notation, and map sequence source descriptors (i.e. metadata) to uniform vocabularies. To resolve this issue, the Virus Variation database loading pipeline parses Gen-Bank records, identifies important metadata terms, such as sample isolation host, date, country and source, and maps these to a standardized vocabulary using a hierarchical approach. doi = 10.1093/nar/gkw1065 id = cord-263987-ff6kor0c author = Holmes, Ian H. title = Solving the master equation for Indels date = 2017-05-12 keywords = Markov; RNA; model; sequence summary = BACKGROUND: Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT: This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS: While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances. doi = 10.1186/s12859-017-1665-1 id = cord-330067-ujhgb3b0 author = Huang, Yi title = CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes date = 2007-10-02 keywords = SARS; sequence summary = To overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, CoVDB (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. CoVDB provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. In CoVDB, with the aim of facilitating gene retrieval, we tried to unify the naming of these non-structural proteins from different groups of coronaviruses. When we compared their putative amino acid sequences to the corresponding ones in other group 1 coronavirus genomes using BLAST, as well as searching for conserved domains using motifscan, results showed that the putative proteins encoded by these ORFs belonged to a protein family in Pfam originally assigned as ''Corona_NS3b'' (accession number PF03053). database, CoVDB, of annotated coronavirus genes and genomes, which offers efficient batch sequence retrieval and analysis. doi = 10.1093/nar/gkm754 id = cord-325985-xfzhn1n1 author = Jabado, Omar J. title = Comprehensive viral oligonucleotide probe design using conserved protein regions date = 2007-12-13 keywords = Pfam; probe; sequence summary = The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. Our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. The EMBL nucleotide sequence database [July 2007, Release 91; 461,353 nucleic acid sequences (31) ] was chosen as the reference for this study because it is tightly integrated with the Pfam protein family database (23, 32 Taxon growth was estimated using a standard least squares method, with the SPSS statistical package. We have described a method that capitalizes on the Pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions. doi = 10.1093/nar/gkm1106 id = cord-017354-cndb031c author = Janies, D. title = Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases date = 2008 keywords = H5N1; SARS; influenza; phylogenetic; sequence; tree summary = The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. Given a tree and a data matrix of sequences and features, the parsimony method can pinpoint the branches on which certain evolutionary events are inferred to occur between ancestor or descendent. Phylogenetic analysis of large genomic datasets can present several nested NPcomplete problems: multiple alignment, tree-search, and in some cases, gene order and complement differences among organisms. We provide exemplar cases in which phylogenetic analyses of viral genomes have been crucial to understand complex patterns of transmission among animal and human hosts: Severe Acute Respiratory Syndrome (SARS) [KSI03] and influenza [WEB92] . Molecular phylogenetic analyses of the nucleotide or inferred amino acid sequence data from various viral isolates can then be used to reconstruct the history of the transmission events the virus among hosts. doi = 10.1007/978-3-540-74331-6_2 id = cord-017584-9rx4jlw8 author = Kim, Kwangsoo title = Selecting Genotyping Oligo Probes Via Logical Analysis of Data date = 2007 keywords = probe; sequence summary = Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. As for the organization of this paper, we develop an effective method for selecting short oligo probes in Section 2 (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in Section 3 with using viral genomic sequences from the Los Alamos National Laboratory and the National Center of Biotechnology Information websites. doi = 10.1007/978-3-540-72665-4_8 id = cord-324021-y1vr1db0 author = Kozak, M. title = Determinants of translational fidelity and efficiency in vertebrate mRNAs date = 1994-12-31 keywords = AUG; codon; sequence; translation summary = doi = 10.1016/0300-9084(94)90182-1 id = cord-353290-1wi1dhv6 author = Kustin, Talia title = Biased mutation and selection in RNA viruses date = 2020-09-28 keywords = Fig; RNA; sequence; virus summary = We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses. Nevertheless, RNA viruses do share several common features that drive their evolution: (a) their ultimate dependence on the cell, (b) their high mutation rates, (c) strong purifying selection derived from constraints operating on a small and densely coding genome, and (d) sporadic but powerful positive selection driven by an evolutionary arms race with the host they infect. Two non-mutually exclusive hypotheses may be put forth to explain the consistent pattern of A-richness that we observe: there is selection for more A in viral sequences, and/or there is a mutational bias that leads to more A in genomes of viruses. doi = 10.1093/molbev/msaa247 id = cord-001340-kqcx7lrq author = Ladner, Jason T. title = Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing date = 2014-06-17 keywords = genome; sequence; viral summary = Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/ DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization. doi = 10.1128/mbio.01360-14 id = cord-321150-ev6acl7b author = Lam, Ha Minh title = Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm date = 2017-10-03 keywords = sequence; site summary = Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed. A strong descent or ascent in the middle of a HGRW indicates that one type of informative site exhibits clustering, and the properties of the random walk can be used to compute exact probabilities of this occurring. To illustrate improved runtimes and memory usage of the new 3SEQ algorithm, we searched for recombinants among large sequence data sets of dengue virus serotype 2, Ebola virus, the coronavirus responsible for Middle-East Respiratory Syndrome (MERS) and Zika virus; see table 1. The genomic alignments of MERS and Zika virus contained 1,150 and 2,792 polymorphic sites, respectively, and >99.9% triplets were able to be tested for mosaicism with exact P values. doi = 10.1093/molbev/msx263 id = cord-025610-7vouj8pp author = Latif, Seemab title = Backward-Forward Sequence Generative Network for Multiple Lexical Constraints date = 2020-05-06 keywords = sequence; word summary = In this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. Recently, Recurrent Neural Networks (RNNs) and their variants such as Long Short Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs) based language models have shown promising results in generating high quality text sequences, especially when the input and output are of variable length. first proposed multiple variants of Backward and Forward (B/F) language models based on GRUs for constrained sentence generation [13] . Therefore, we have proposed a neural probabilistic Backward-Forward architecture that can generate high quality sequences, with word embedding substitution method to satisfy multiple constraints. In this paper, we have proposed a novel method, dubbed Neural Probabilistic Backward-Forward language model and word embedding substitution method to address the issue of lexical constrained sequence generation. doi = 10.1007/978-3-030-49186-4_4 id = cord-331698-rwow1ydx author = Latorre-Pérez, Adriel title = A lab in the field: applications of real-time, in situ metagenomic sequencing date = 2020-08-20 keywords = 16S; ONT; dna; metagenomic; sequence; sequencing summary = This review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis. Therefore, the ultra-portability, affordability, and speed in data production make the MinION technology suitable for real-time sequencing in a variety of environments, such as Ebola surveillance in West Africa during the last outbreak [25] , microbial communities inspection in the Arctic [26] , DNA sequencing on the International Space Station (ISS) [27] , and even the recently emerging pandemic coronavirus SARS-CoV-2 [28, 29] . In fact, there are some critical points to be addressed before this technique could become a standard in the industry: (i) sequencing cost should be reduced; (ii) rapid and reliable in situ DNA extraction and library preparation protocols should be designed and validated; (iii) minimal sequencing yields should be determined for each specific application; (iv) fast and real-time pipelines should be created and tested; and (v) level of expertise for managing the data and the samples should be notably reduced. doi = 10.1093/biomethods/bpaa016 id = cord-252347-vnn4135b author = Lee, Wai-Ming title = A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants date = 2007-10-03 keywords = HRV; P1-P2; PCR; sequence summary = METHODS AND FINDINGS: To directly type HRVs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5'' noncoding region with homologous sequences of the 101 known serotypes. The degenerate primers EV292 and EV222 for PCR amplification of NIm-1A region were not sensitive enough for direct detection of small amount of HRV in original clinical samples (data not shown), and high titer infected cell lysates of cultured isolates were needed to produce enough PCR product for cloning and sequencing. This new assay had 3 key components: sensitive pan-HRV primers and semi-nested PCR to amplify P1-P2 region from cDNA prepared from original clinical specimens, a sequence database of 260-bp P1-P2 region of 5''NCR of all 101 HRV serotypes to serve as standard references for HRV identification, and phylogenetic tree reconstruction of the new P1-P2 sequences and the 101 homologous reference sequences. doi = 10.1371/journal.pone.0000966 id = cord-338207-60vrlrim author = Lefkowitz, E.J. title = Virus Databases date = 2008-07-30 keywords = NCBI; database; datum; information; sequence summary = (Each arrow points to the table containing the primary key.) Tables are color-coded according to the source of the information they contain: yellow, data obtained from the original GenBank sequence record and the ICTV Eighth Report; pink, data obtained from automated annotation or manual curation; blue, controlled vocabularies to ensure data consistency; green, administrative data. While most of us store our BLAST search results as files on our desktop computers, it is useful to store this information within the database to provide rapid access to similarity results for comparative purposes; to use these results to assign genes to orthologous families of related sequences; and to use these results in applications that analyze data in the database and, for example, display the results of an analysis between two or more types of viruses showing shared sets of common genes. doi = 10.1016/b978-012374410-4.00719-6 id = cord-342785-55r01n0x author = Lemmon, Gordon H title = Predicting the sensitivity and specificity of published real-time PCR assays date = 2008-09-25 keywords = PCR; sequence; signature; time summary = METHODS: We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. This analysis must include the predicted false negative and false positive rates for the developed signatures, and consider all available public sequence data. A freely available real time PCR analysis tool called TaqSim [4] was used to find public sequences that would match the primer/probe assay in question. However, according to the genomic data available, a better match of primers and probes to target is possible and is usually desired for high sensitivity detection. Current real-time PCR assay design approaches produce signatures with sensitivities generally too low for clinical use. Fifty Seven TaqMan PCR primer/probe combinations we predict to have higher sensitivity/specificity than current published assays. Development of quantitative gene-specific real-time RT-PCR assays for the detection of measles virus in clinical specimens doi = 10.1186/1476-0711-7-18 id = cord-321386-u1imic5l author = Li, Chun title = Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation date = 2018-02-17 keywords = Prot; dna; protein; sequence summary = METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. Numerical characterization of protein sequences based on the generalized Chou''s pseudo amino acid composition doi = 10.2174/1386207321666180130100838 id = cord-306725-0vam15pt author = Li, Hao title = First detection and genomic characteristics of bovine torovirus in dairy calves in China date = 2020-05-09 keywords = China; sequence summary = Sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the S protein compared to the complete S sequences of BToV available in the GenBank database. A phylogenetic analysis based on the complete amino acid sequence of the S protein showed that the BToVs could be separated into four groups (Fig. 2) , designated tentatively as group 1 to group 4. The bovine torovirus strains BToV/SC-1/China and BToV /SC-2/China investigated in this study are indicated by black triangles Fig. 2 Phylogenetic tree based on the deduced 1586-aa sequence of the complete S gene. Moreover, the two Chinese strains shared identical unique amino acid changes in the S and HE genes when compared to the other strains with sequences available in the GenBank database, indicating the unique evolution in Chinese BToV strains. Moreover, two complete BToV genome sequences were obtained from the clinical samples, and these two BToV isolates had unique amino acid changes in the S and HE proteins. doi = 10.1007/s00705-020-04657-9 id = cord-341879-vubszdp2 author = Li, Lucy M title = Genomic analysis of emerging pathogens: methods, application and future trends date = 2014-11-22 keywords = disease; population; sequence summary = In this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases. In this review, we provide an overview of recent developments in genomic methods in the context of infectious diseases, evaluate integrative methods that incorporate genetic data in epidemiological analysis, and discuss the application of these methods to EIDs. Over the last two decades, sequence data have increased in quality, length and volume due to improvements in the underlying technology and decreasing costs. In recent cases of EIDs, genomic data have helped to classify and characterize the pathogen, uncover the population history of the disease, and produce estimates of epidemiological parameters. Just as compartmental models can be fitted to surveillance data to infer the epidemiological dynamics of an infectious disease (Box 1), the coalescent framework allows inference of population history from pathogen sequences. doi = 10.1186/s13059-014-0541-9 id = cord-345552-h6fwi0qn author = Li, Q.-G. title = Hydropathic characteristics of adenovirus hexons date = 1997-07-01 keywords = dna; hexon; sequence summary = The strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. The sequence of the predicted protein, consisting of 937 amino acids, was obtained with the LaserGene software program EditSeq. The hydropathy data of hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 were derived using the prediction method of Kyte-Doolittle in the LaserGene computer program Protean. The nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed serotypes of subgenera B, D and E to be closely related (Table 3 and Fig. 2) . DNA sequence of the adenovirus type 41 hexon gene and predicted structure of the protein doi = 10.1007/s007050050162 id = cord-001537-i34vmfpp author = Lima, Francisco Esmaile de Sales title = Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil date = 2015-02-17 keywords = Circoviridae; Cyclovirus; dna; sequence summary = The predicted protein sequences encoded by ORF2 (cap) and ORF1 (rep) of BatCV I-VI genomes were used for phylogenetic analysis with representative and recently discovered circoviruses/cycloviruses; Pepper golden mosaic virus was used as outgroup, as they are somewhat related to other members in the Circoviridae family (Fig. 3A, 3B and 3C ). The phylogenetic analysis constructed based on the alignments of the complete REP and CAP protein confirms that BatCV POA/II and VI cluster into the genus Cyclovirus along with the Chinese cycloviruses sequences clade detected in bat feces [18] and sharing less than 65% of identity at the CAP/REP amino acid level. BatCV POA I and V had a low amino acid identity with CAP (<20%) and REP (<10%) sequences of two other sequences detected in bat feces in this study with known circoviruses/cycloviruses (Table 2) . doi = 10.1371/journal.pone.0118070 id = cord-330312-1pjolkql author = Liu, Y.-T. title = Infectious Disease Genomics date = 2017-01-20 keywords = HGP; genome; human; malaria; sequence summary = One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. 16, 17 The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002. 30e32 Genome-sequencing projects for other important human disease vectors are in progress. 38 One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. 48 The completed or ongoing genome projects (Table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. Genome sequence of the human malaria parasite Plasmodium falciparum doi = 10.1016/b978-0-12-799942-5.00010-x id = cord-265857-fs6dj3dp author = Liu, Yu-Tsueng title = Infectious Disease Genomics date = 2010-12-24 keywords = genome; human; sequence summary = The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002 (Gardner et al., 2002; Holt et al., 2002) . Genome sequencing projects for other important human disease vectors are in progress Megy et al., 2009 ). One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. The completed or ongoing genome projects (Table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. doi = 10.1016/b978-0-12-384890-1.00010-8 id = cord-287658-c2lljdi7 author = Lopez-Rincon, Alejandro title = Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning date = 2020-09-10 keywords = CoV-2; RNA; SARS; sequence summary = The discovered sequences are first validated on samples from other repositories, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. The discovered sequences are validated on samples from NCBI and GISAID, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. For example, we can use this sequencing data with cDNA, resulting from the PCR of the original viral RNA; e,g, Real-Time PCR amplicons to identify the SARS-CoV-2 16 . The global impact of SARS-CoV-2 prompted researchers to apply effective alignment-free methods to the classification of the virus: For example, in 26 the authors propose the use of Machine Learning Digital Signal Processing for separating the virus from similar strains, with remarkable accuracy. We calculated the frequency of appearance of different primer sets'' sequences used in SARS-CoV-2 RT-PCR tests developed by WHO referral laboratories and compared it to our primer design in the dataset from the GISAID ( Table 2) repository. doi = 10.1101/2020.03.13.990242 id = cord-302161-ytr7ds8i author = Lutz, Mirjam title = FCoV Viral Sequences of Systemically Infected Healthy Cats Lack Gene Mutations Previously Linked to the Development of FIP date = 2020-07-24 keywords = FIP; ORF; sequence; zu1 summary = doi = 10.3390/pathogens9080603 id = cord-025948-6dsx7pey author = Maitra, Arindam title = Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility date = 2020-06-04 keywords = India; RNA; SARS; mutation; sequence summary = Direct massively parallel sequencing of SARS-CoV-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in Eastern India. We have initiated a study on sequencing of SARS-CoV-2 genome from swab samples obtained from infected individuals from different regions of West Bengal in Eastern India and report here the first nine sequences and the results of analysis of the sequence data with respect to other sequences reported from the country until date. The A2a clade is characterized by the signature nonsynonymous mutations leading to amino acid changes of P323L in the RdRp which is involved in replication of the viral genome and the change of D614G in the Spike glycoprotein which is essential for the entry of the virus in the host cell by binding to the ACE2 receptor. We have also detected emergence of mutations in the important regions of the viral genome including Spike, RdRP and nucleocapsid coding genes. doi = 10.1007/s12038-020-00046-1 id = cord-010161-bcuec2fz author = Matson, David O. title = IV, 6. Calicivirus RNA recombination date = 2004-09-14 keywords = RNA; sequence summary = With the description of statistically significant phylogenetic clades within CV genera, data were available to recognize strains that might be natural recombinants within CVs. Two examples are the well-characterized Argentine strain 320 (Arg320) and Snow Mountain virus (SMV), one of the prototype CVs, recognized to be recombinants when the RNA polymerase and capsid regions of these strains were characterized (Hardy et al., 1997; Jiang et al., 1999) (Fig. 2) . While SMV was likely also to be a recombinant virus, the capsid and RNA polymerase region amplicons of SMV were generated separately and that fact did not exclude the possibility of different sources of strains. Infection of single cells simultaneously by two CVs implies absence of immune or molecular and of 40 nt near the 5'' end of that strain''s capsid gene (ID="B" sequence for this Fig.) . The sequence data indicated that recombination in strain Arg320 occurred at the ORF1/capsid gene junction where high sequence identity exists between the putative parent clades. doi = 10.1016/s0168-7069(03)09032-3 id = cord-275258-azpg5yrh author = Mead, Dylan J.T. title = Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling date = 2019-07-26 keywords = CNN; model; sequence; table summary = title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. We then present the first use of force-directed graphs to produce an intuitive visualization of sequence space, and select target RdRPs without solved structures for homology modelling. The solved structure has 10 other sequences in its proximity in the three-dimensional space, roughly Table 5 Homology modelling at intra-order, inter-family level. doi = 10.1016/j.jmgm.2019.07.014 id = cord-027316-echxuw74 author = Modarresi, Kourosh title = Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model date = 2020-05-22 keywords = model; sequence summary = This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. Deep Learning has become a main model in natural language processing applications [6, 7, 11, 22, 38, 55, 64, 71, 75, 78-81, 85, 88, 94] . Though, modified version of RNN like LSTM and GRU have been improvement over RNN (recurrent neural networks) in dealing with vanishing gradients and long-term memory loss, still they suffer from many deficiencies. Given the complexity of these dependencies, a neural network model is used to compute these weights. The embedding regularization is, α Embedding Error 2 (6) Input to any model has to be a number and hence the raw input of words or text sequence needs to be transformed to continuous numbers. Learning phrase representations using RNN encoder-decoder for statistical machine translation doi = 10.1007/978-3-030-50420-5_20 id = cord-325750-x7jpsnxg author = Mokili, John L title = Metagenomics and future perspectives in virus discovery date = 2012-01-20 keywords = Koch; dna; figure; metagenomic; sequence; viral; virus summary = doi = 10.1016/j.coviro.2011.12.004 id = cord-000642-mkwpuav6 author = Moreira, Rebeca title = Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing date = 2012-04-19 keywords = Ruditapes; immune; philippinarum; protein; sequence summary = title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum. Moreover, a few transcripts encoded by genes putatively involved in the clam immune response against Perkinsus olseni have been reported by cDNA library sequencing [18] . philippinarum transcriptome and another four bivalve species sequences were analyzed by comparative genomics (Crassostrea gigas of the family Ostreidae, Bathymodiolus azoricus and Mytilus galloprovincialis of the family Mytilidae and Laternula elliptica of the family Laternulidae). doi = 10.1371/journal.pone.0035009 id = cord-311240-o0zyt2vb author = Motayo, Babatunde Olarenwaju title = Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences date = 2020-07-27 keywords = Africa; SARS; sequence summary = Our study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa. The pathogen was later identified to be a novel coronavirus closely related to the severe acute respiratory syndrome virus (SARS), with a possible bat origin (Zhou et al, 2020) . This study was designed to determine to the genetic diversity and evolutionary history of genome sequences of SARSCoV-2 isolated in Africa. Results of recombination analysis of the African SARSCoV-2 (AfrSARSCoV-2) sequences against references whole genome sequences of SARS, Recombination signals were observed between the African SARSCoV-2 sequences and reference sequence (Major recombinant hCoV-19 Pangolin/Guangu P4L/2017; Minor parent hCoV-19 B batYunan/RaTG13) between the RdRP and S gene regions (Figure 2 ). doi = 10.1101/2020.07.27.222901 id = cord-018459-isbc1r2o author = Munjal, Geetika title = Phylogenetics Algorithms and Applications date = 2018-12-10 keywords = phylogenetic; sequence summary = This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences. These alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15] . Constructing phylogenetic trees using multiple sequence alignment doi = 10.1007/978-981-13-5934-7_17 id = cord-264746-gfn312aa author = Muse, Spencer title = GENOMICS AND BIOINFORMATICS date = 2012-03-29 keywords = RNA; dna; figure; gene; genome; sequence summary = The success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of DNA and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of DNA; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. Although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today''s environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism''s genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research. doi = 10.1016/b978-0-12-238662-6.50015-x id = cord-321762-7kiahjyy author = Nandy, Ashesh title = Chapter 5 The GRANCH Techniques for Analysis of DNA, RNA and Protein Sequences date = 2015-12-31 keywords = dna; graphical; protein; representation; sequence summary = doi = 10.1016/b978-1-68108-053-6.50005-3 id = cord-326225-crtpzad7 author = Neill, John D. title = Simultaneous rapid sequencing of multiple RNA virus genomes date = 2014-06-01 keywords = RNA; sequence; virus summary = This procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3′-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. There is a wealth of information in these isolates, but up till now, it has been time consuming and expensive to sequence these viral genomes, often requiring sets of strain-specific primers for PCR amplification and sequencing. These primers were developed so that the 20 base known sequence was used for PCR amplification of the library as well as served as a barcode for identifying each viral library following pooling and sequencing. This virus, a BVDV 1b strain isolated from alpaca (GenBank accession JX297520.1; Table 2 , library 3, barcode 10), was assembled from Ion Torrent data and was found to have only 1 base difference from the sequence determined earlier (data not shown). One virus, library 1, barcode 9, had only 658 viral sequence reads but 94.4% of the genome was assembled. doi = 10.1016/j.jviromet.2014.02.016 id = cord-014461-2ubh9u8r author = Nelson, Oranmiyan W. title = Genome sequences published outside of Standards in Genomic Sciences, July - October 2012 date = 2012-10-10 keywords = Complete; Draft; Genome; Strain; isolate; sequence summary = Complete Genome Sequence of Brucella abortus A13334, a New Strain Isolated from the Fetal Gastric Fluid of Dairy Cattle Complete Genome Sequence of Brucella canis Strain HSK A52141, Isolated from the Blood of an Infected Dog Complete Genome Sequence of Streptococcus salivarius PS4, a Strain Isolated from Human Milk Complete Genome Sequences of Probiotic Strains Bifidobacterium animalis subsp. Complete Genome Sequence of Corynebacterium pseudotuberculosis Strain 1/06-A, Isolated from a Horse in North America Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Characterization and Complete Genome Sequence of Human Coronavirus NL63 Isolated in China Complete Genome Sequence of a Novel Pararetrovirus Isolated from Soybean Complete Genome Sequence of a Polyomavirus Isolated from Horses Complete Genome Sequence of a Novel Porcine Sapelovirus Strain YC2011 Isolated from Piglets with Diarrhea Draft Genome Sequence of Aspergillus oryzae Strain 3.042 doi = 10.4056/sigs.3416907 id = cord-016293-pyb00pt5 author = Newell-McGloughlin, Martina title = The flowering of the age of Biotechnology 1990–2000 date = 2006 keywords = FDA; Genome; NIH; RNA; U.S.; University; Venter; cell; disease; dna; gene; human; plant; sequence; technology summary = doi = 10.1007/1-4020-5149-2_4 id = cord-255371-o9oxchq6 author = Nguyen, Thanh Thi title = Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) date = 2020-07-10 keywords = SARS; mutation; protein; sequence summary = title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. We use 6,324 SARS-CoV-2 genome sequences collected in 45 countries and deposited to the NCBI GenBank so far and create a spreadsheet dataset of all mutations occurred across different genes. In this paper, to evaluate the possible impacts of genomic mutations on the virus functions, we propose the use of the SSpro/ACCpro 5 methods to predict protein secondary structure and relative solvent accessibility [13] . By comparing the prediction results obtained on the reference genome and mutated genomes, we are able to assess whether the detected mutations have the potential to change the protein structure and solvent accessibility, and thus lead to possible changes of the virus characteristics. doi = 10.1101/2020.07.10.171769 id = cord-012975-u87ol3fs author = Ogiwara, Atsushi title = Construction of a dictionary of sequence motifs that characterize groups of related proteins date = 1992-09-17 keywords = motif; sequence summary = An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. The conserved amino acid patterns, often called consensus patterns or sequence motifs (Taylor, 1988; Hodgman, 1989) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. This procedure is applied to the superfamily grouping of the PIR database and a library of sequence motifs is constructed that identifies specific superfamilies. Functional groups of proteins Suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. Because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites. doi = 10.1093/protein/5.6.479 id = cord-355075-ieb35upi author = Papenfuss, Anthony T title = The immune gene repertoire of an important viral reservoir, the Australian black flying fox date = 2012-06-20 keywords = MHC; RNA; bat; gene; sequence summary = alecto transcriptome provides information on a variety of immune genes not previously identified in any bat species and represents an important starting point for examining the antiviral activity of these molecules. To enrich for sequences corresponding to cytokines and innate immune genes, the second dataset was derived from pooled total RNA obtained from mitogen-stimulated spleen, white blood cells and lymph node and unstimulated thymus and bone marrow obtained from one pregnant female and one adult male flying fox. A full length transcript, encoding a 667 amino acid protein was identified in our bat transcriptome datasets and found to be orthologous to Mx1 based on comparison with known mammalian Mx1 and Mx2 family members (Figure 4a and data not shown). Genes involved in the adaptive immune system, including MHC class I and II genes and T and B cell receptors and co-receptors were highly represented in both the thymus and pooled datasets providing evidence that bats have all of the components necessary to mount an adaptive immune response. doi = 10.1186/1471-2164-13-261 id = cord-304607-td0776wj author = Paszkiewicz, Konrad H. title = Omics, Bioinformatics, and Infectious Disease Research date = 2010-12-24 keywords = gene; genome; protein; sequence summary = doi = 10.1016/b978-0-12-384890-1.00018-2 id = cord-264135-s2u76pvk author = Patel, Amrutlal K. title = Complete genome sequence analysis of chicken astrovirus isolate from India date = 2016-12-23 keywords = indian; sequence summary = Phylogenetic analysis of the astrovirus genomes suggested formation of separate cluster of chicken astroviruses and placed CAstV/INDIA/ANAND/2016 nearest to the CAstV/4175 isolate (Fig. 2) . B-cell epitope analysis of capsid structural protein of identified chicken astrovirus isolate A total of 9-10 epitopes were predicted using SVMTriP using the capsid protein sequence of the astroviruses. Phylogenetic analysis of the genome sequences as well as the protein sequences showed clustering of the CAstV/ INDIA/ANAND/2016 nearest to that of CastV/4175 and CAstV/GA2011 and all four chicken astrovirus formed separate cluster except capsid protein of the CAstV/Poland/G059/ 2014 isolate which was clustered along with the duck astroviruses. The analysis of capsid protein sequence of reported chicken astroviruses from India revealed limited structural divergence suggesting their common ancestral origin and recent emergence. Fig. 4 Phylogenetic relatedness of chicken astrovirus isolate CAstV/India/Anand/2016 ORF2 coding sequences (a) and ORF2 encoded capsid protein (b) with reported Indian isolates based on neighbour-joining method with doi = 10.1007/s11259-016-9673-6 id = cord-341564-fvuwick5 author = Qi, Zhao-Hui title = Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application date = 2018-06-12 keywords = protein; sequence summary = From these, we can see that physicochemical properties are widely applied with graphical representation of protein sequences by these researchers and their results seem well. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. Therefore, to mine essential information from a protein sequence, we propose an effective graphical method combining physicochemical properties of amino acids and the BLOSUM62 matrix. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids doi = 10.1177/1176934318777755 id = cord-321715-bkfkmtld author = Redelings, Benjamin D title = Incorporating indel information into phylogeny estimation for rapidly emerging pathogens date = 2007-03-14 keywords = alignment; distribution; indel; model; sequence summary = To see if indel information improves phylogenetic resolution we compare the number of bi-partitions that are supported under the joint model and the traditional sequential approach, in which topology reconstruction assumes a previously determined alignment. These parameters include a multiple alignment A that specifies the positional homology between the sequences Y, an evolutionary tree (τ, T) where τ is an unrooted bifurcating tree topology and T = (t 1 , ..., t 2N -3 ) is a vector of branch lengths along the edges in τ, and vectors Θ and Λ are parameters that characterize the letter substitution and indel processes respectively. We therefore propose a new pairwise alignment prior that maintains a fixed sequence length distribution φ even when the indel probability varies from branch to branch. Since the joint model balances substitution and indel information as well as taking alignment ambiguity into account we assume that these differences represent an improvement in the accuracy of estimation. doi = 10.1186/1471-2148-7-40 id = cord-267500-x3u9i1vq author = Rose, Rebecca title = Challenges in the analysis of viral metagenomes date = 2016-08-03 keywords = Assembly; Bruijn; read; sequence summary = Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of lowfrequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. We recently proposed a method based on numerical sequence representations and digital signal processing data transformation (SPDT) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (Tapinos et al. doi = 10.1093/ve/vew022 id = cord-300149-djclli8n author = Ruan, Yijun title = Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection date = 2003-05-24 keywords = SARS; sequence summary = title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection METHODS: We sequenced the entire SARS viral genome of cultured isolates from the index case (SIN2500) presenting in Singapore, from three primary contacts (SIN2774, SIN2748, and SIN2677), and one secondary contact (SIN2679). In addition, a common variant associated with a non-conservative aminoacid change in the S1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the SARS virus in human populations. All genetic variations of Singapore isolates identified when compared with available SARS-CoV genome sequences were further confirmed by primer extension genotyping technology (Sequenom, San Diego, CA, USA). These sequences showed that the genomes of SARS-CoV isolated in Singapore are comprised of 29 711 bases, with the exception of a five-nucleotide deletion in strain SIN2748 and a six-nucleotide deletion in SIN2677. doi = 10.1016/s0140-6736(03)13414-9 id = cord-015850-ef6svn8f author = Saitou, Naruya title = Eukaryote Genomes date = 2013-08-22 keywords = RNA; dna; gene; genome; sequence summary = General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] . doi = 10.1007/978-1-4471-5304-7_8 id = cord-264296-0x90yubt author = Sawmya, Shashata title = Analyzing hCov genome sequences: Applying Machine Intelligence and beyond date = 2020-06-03 keywords = China; Coronavirus; India; sequence summary = We present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. C. Several CNN-RNN based models are used to predict mutations at specific Sites of Interest (SoIs) of the sars-cov-2 genome sequence followed by further analyses of the same on several South-Asian countries. D. Overall, we present an analysis pipeline that can be further utilized as well as extended and revised (a) to study where a newly discovered genome sequence lies in relation to its predecessors in different regions of the world; (b) to analyse its virulence with respect to the number of deaths its predecessors have caused in their respective countries and (c) to analyse the mutation at specific important sites of the viral genome. doi = 10.1101/2020.06.03.131987 id = cord-268467-btfz6ye8 author = Schreiber, Steven S. title = Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E date = 1989-03-31 keywords = HCV-229E; RNA; sequence summary = The 3′-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the Coronavirus family and lends support to the theory that this region is important for the replication of negative-strand RNA. This result suggested that the HCV229E subgenomic mRNAs possess a nested-set structure similar to other coronaviruses and that A34 represented a cDNA clone of either the 3''-end of the genomic RNA or the leader sequence. The 3''-noncoding region contains the sequence TGGAAGAGCCA, 75 nucleotides from the 3''-end (Fig. 4) which is relatively conserved among coronaviruses and is found at approximately the same location in all of these viral genomes (Kapke and Brian, 1986; Skinner and Siddell, 1984; Armstrong et a/., 1983; Lapps et al., 1987; Kamahora et a/., 1988; Boursnell et al., 1985) ( Table 1) . Three intergenic regions of coronavirus mouse hepatitis virus strain A59 genome RNA contain a common nucleotide sequence that is homologous to the 3''end of the viral mRNA leader sequence doi = 10.1016/0042-6822(89)90050-0 id = cord-010273-0c56x9f5 author = Simmonds, Peter title = Virology of hepatitis C virus date = 2001-10-10 keywords = HCV; RNA; hepatitis; sequence; virus summary = 1,2 The identification of HCV led to the development of diagnostic assays for infection, based either on detection of antibody to recombinant polypeptides expressed from cloned HCV sequences or direct detection of virus ribonucleic acid (RNA) sequences by polymerase chain reaction (PCR) using primers complimentary to the HCV genome. 6 ''13 Remarkably, a series of plant viruses that are structurally distinct from each of the mammalian virus groups, and with different genome organizations, have RNA-dependent RNA polymerase amino acid sequences that are perhaps more similar to those of HCV than are the flaviviruses. In contrast to the highly restricted sequence diversity of the 5''NCR and adjacent core region, the two putative envelope genes are highly divergent between different variants of HCV (Table III) 111-114 and show a three-to-four-times higher rate of sequence change with time in persistently infected patients, ll5 Because these proteins are likely to lie on the outside of the virus, they would be the principal targets of the humoral immune response to HCV elicited on infection. doi = 10.1016/s0149-2918(96)80193-7 id = cord-213136-euv6pqh5 author = Singh, Kulveer title = Sequence Effects on Internal Structure of Droplets of Associative Polymers date = 2020-05-17 keywords = polymer; sequence summary = We study the evolution of internal structure of large droplets (morphology of clusters of stickers) and the kinetics of interconversion between intramolecular and intermolecular associations, for different sequences of our model polymers. Since at t = 0 we begin with a dilute solution of associating polymers in poor solvent in which most of the chains contain intramolecular bonds between their stickers, the observation of a second peak that corresponds to intermolecular bridges means that major molecular rearrangement takes place inside droplets formed by polymers with s8s, 1s6s1 and 2s4s2 sequences. For three of the sequences (s8s, 1s6s1 and 2s4s2) we found that the average spatial distance R ss between the two stickers of a polymer inside the condensed droplet has a bimodal distribution, such that one of the peaks corresponds to intramolecular bonds and the other to intermolecular bridges between clusters (or between different parts of a long fiber of stickers). doi = nan id = cord-022348-w7z97wir author = Sola, Monica title = Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing? date = 2007-09-02 keywords = HIV-1; RNA; figure; sequence; virus summary = An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships. Under the rubric replication, a virus could vary to increase its fitness, exploit different target cells or evade adaptive immune responses. For a given virus, different protein sequence sets were compared to a given reference such as RT in the case of HIV/SIV. Although these data were derived from completely sequenced primate immunodeficiency viral genomes, analyses on larger data sets, such as p17 Gag/p24 Gag or gp120/gp41, yielded relative values that differed from those given in Table 6 .1 by at most 14%. An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships (Table 6 .1). In the clear cases where genetic variation is exploited by RNA viruses, it is used to overcome barriers to transmission set up by the host population, e.g. herd immunity. doi = 10.1016/b978-012220360-2/50007-6 id = cord-266960-kyx6xhvj author = Temple, Mark D. title = Real-time audio and visual display of the Coronavirus genome date = 2020-10-02 keywords = RNA; audio; display; sequence summary = The sonification of codons derived from all three reading frames of the viral RNA sequence in combination with sonified metadata provide the framework for this display. CONCLUSION: The auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the RNA genome. Audio generated from each of these sequence motifs and metadata were combined to create a complex auditory display to represent either transcription or translation. High resolution analysis of gene expression in Coronavirus genomes has detected ribosome protected fragments which map to non-canonical ORF''s, these may be novel protein-coding ORFs and short regulatory uORFs. The tool highlights the occurrence of one such uORF of 30 nucleotides (including the stop codon) in the 5′ untranslated region downstream of TRS1 [35] that is not documented in the GenBank metadata. In the Additional file 4: supplementary example ''Sonification Sub-genomic RNA'' the auditory display represents the process of transcription. doi = 10.1186/s12859-020-03760-7 id = cord-300807-9u8idlon author = Tong, Joo Chuan title = 7 Infectious disease informatics date = 2013-12-31 keywords = disease; sequence summary = doi = 10.1533/9781908818416.99 id = cord-254942-g51mjj2b author = Touati, Rabeb title = New methodology for repetitive sequences identification in human X and Y chromosomes date = 2020-10-19 keywords = dna; repetitive; sequence summary = doi = 10.1016/j.bspc.2020.102207 id = cord-301827-a7hnuxy5 author = Uversky, Vladimir N title = A decade and a half of protein intrinsic disorder: Biology still waits for physics date = 2013-04-29 keywords = IDPs; bind; disorder; function; interaction; intrinsic; protein; region; sequence; structure summary = 94 Therefore, the abundance and peculiarities of the charged residues distribution within the protein sequences might determine physical and biological properties of extended IDPs and IDPRs. Also, simple polymer physics-based reasoning can give reasonably well-justified explanation of the conformational behavior of extended IDPs. In general, the conformational behavior of IDPs is characterized by the low cooperativity (or the complete lack thereof) of the denaturant-induced unfolding, lack of the measurable excess heat absorption peak(s) characteristic for the melting of ordered proteins, "turned out" response to heat and changes in pH, and the ability to gain structure in the presence of various binding partners. 183 This analysis revealed that proteins involved in regulation and execution of PCD possess substantial amount of intrinsic disorder and IDPRs were implemented in a number of crucial functions, such as protein-protein interactions, interactions with other partners including nucleic acids and other ligands, were shown to be enriched in post-translational modification sites, and were characterized by specific evolutionary patterns. doi = 10.1002/pro.2261 id = cord-339209-oe8onyr9 author = Vasilakis, Nikos title = Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range date = 2014-05-20 keywords = RNA; figure; mesoniviruse; sequence; virus summary = The organization of each genome was similar to that described previously for the mesoniviruses (NDiV, CavV, HanaV, NseV and MenoV), featuring a long 5''-untranslated region (5''-UTR) of 359 to 370 nt, six major long open reading frames (ORFs), and a long terminal region of 1780 to 1804 nt preceding the poly[A] tail ( Figure 2 ). To determine the phylogenetic relationships of the newly identified insect viruses, maximum likelihood (ML) phylogenetic trees were constructed based on the amino acid alignments of ORF2a (unprocessed S protein) and a concatenated region of the highly conserved domains within ORF1ab (3CL pro , RdRp and ZnHel1). A Clustal X alignment of the mesonivirus ORF3a proteins and individual structural analyses using SignalP and TMHMM and NetNGlyc (www.expasy.org) indicated that each is a class I transmembrane glycoprotein with a predicted N-termimal signal peptide, an ectodomain containing a conserved set of 6 cysteine residues and a single conserved N-glycosylation site, a transmembrane domain and a C-terminal cytoplasmic domain ( Figure 4A, 4D) . doi = 10.1186/1743-422x-11-97 id = cord-296691-cg463fbn author = Wang, Ren title = De novo Sequence Assembly and Characterization of Lycoris aurea Transcriptome Using GS FLX Titanium Platform of 454 Pyrosequencing date = 2013-04-09 keywords = Amaryllidaceae; Lycoris; alkaloid; sequence summary = doi = 10.1371/journal.pone.0060449 id = cord-324216-ce3wa889 author = Wang, Zheng title = Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses date = 2008-12-01 keywords = HEV; HRV; flu; sequence summary = Due to the great genetic diversity of HRV and HEV, in order to ensure that designed probes (referred to as probe sequences) generated from selected database sequences (referred to as prototype regions) would detect and discriminate all serotypes of HRV and HEV, a predictive model was used to assist the microarray design [17] . This study demonstrated the use of an algorithm for the design of probe sets based on an in silico predictive model [17] , developed by our group, that minimized the probes needed for detection and identification of most serotypes of HRV and HEV. A powerful feature of the expanded RPM-Flu v.30/31 resequencing pathogen microarray is that the nucleotide sequences generated from hybridization of the sample RNA/DNA and array-bound probe sets in conjunction with previously developed sequence analysis algorithm CIBSI can be easily interpreted to make serotype or strain identifications. doi = 10.1186/1471-2164-9-577 id = cord-022494-d66rz6dc author = Webb, B. title = Comparative Modeling of Drug Target Proteins date = 2014-10-01 keywords = comparative; model; modeling; sequence; structure summary = Comparative modeling consists of four main steps 23 (Figure 2 (a)): (1) fold assignment that identifies similarity between the target sequence of interest and at least one known protein structure (the template); (2) alignment of the target sequence and the template(s); (3) building a model based on the alignment with the chosen template(s); and (4) predicting model errors. Modeller implements comparative protein structure modeling by the satisfaction of spatial restraints that include: (1) homologyderived restraints on the distances and dihedral angles in the target sequence, extracted from its alignment with the template structures; 35 (2) stereochemical restraints such as bond length and bond angle preferences, obtained from the CHARMM-22 molecular mechanics force field; 107 (3) statistical preferences for dihedral angles and nonbonded interatomic distances, obtained from a representative set of known protein structures; 108 and (4) optional manually curated restraints, such as those from NMR spectroscopy, rules of secondary structure packing, cross-linking experiments, fluorescence spectroscopy, image reconstruction from electron microscopy, site-directed mutagenesis, and intuition ( Figure 2(b) ). doi = 10.1016/b978-0-12-409547-2.11133-3 id = cord-311839-61djk4bs author = Wei, Dan title = A novel hierarchical clustering algorithm for gene sequences date = 2012-07-23 keywords = BKM; clustering; dna; sequence summary = We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. DMk shows better performance than the k-tuple distance in our experiments, and mBKM outperforms SL, CL, AL, BKM and KM when tested on public gene sequence datasets. In this paper we propose a new alignment-free similarity measure, DMk, based on which we developed mBKM to cluster gene sequences. To evaluate the proposed similarity measure, we test DMk on gene sequence data sets and compare it with the k-tuple distance. Moreover, we use our method, mBKM with similarity measure DMk, in phylogenetic analysis to show how well the genes are grouped together and how well the resulting trees agree with existing phylogenies. In order to illustrate the efficiency of mBKM in gene sequence clustering, we ran mBKM with the k-tuple distance and DMk on real data sets listed in Table 1 . doi = 10.1186/1471-2105-13-174 id = cord-343863-q1y8uscj author = Whitney, Joe title = Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches date = 2005-02-08 keywords = blast; sequence summary = ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. The complete ReHAB hits database can then be queried by date using a simple GUI to allow the researcher to easily identify new hits; these are highlighted, and pairwise or multiple alignments can be performed to assess the quality of the match. ReHAB consists of four main components ( Figure 1 ): (1) a MySQL relational database that stores information about hits, including biological sequences, alignments between them, and other categorization and annotation data; (2) a Java server that provides access to programs which cannot be run locally by the client on arbitrary user workstations, such as NCBI BLAST and EMBOSS [12] utilities; (3) a Java Swing graphical client, downloaded and launched on client machines using Java Web Start; (4) and a back-end Java program which runs alignment programs and compiles results in the database. doi = 10.1186/1471-2105-6-23 id = cord-103029-nc5yf6x4 author = Wichmann, Stefan title = Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank date = 2020-09-25 keywords = Fig; OLG; SGC; sequence summary = In this study the artificially designed sequences are compared to their original sequences in terms of amino acid identity, amino acid similarity, Hidden Markov Model profile and secondary structure in order to determine the impact of OLG construction and which sequences are potentially functional. While the previous study [30] tried to estimate an upper limit of how many domains can be successfully overlapped in at least one reading frame and position, here the average success rate for OLG construction is determined instead, which is more relevant in relation to both understanding constraints on the formation rate of naturally occuring OLGs and in assessing the likelihood of successful synthetic creation of OLGs. These results in one sense give an upper estimate of the ease of creating overlaps as the difficulty of obtaining an overlapping gene pair naturally is not directly addressed here. doi = 10.1101/2020.09.25.312959 id = cord-103297-4stnx8dw author = Widrich, Michael title = Modern Hopfield Networks and Attention for Immune Repertoire Classification date = 2020-08-17 keywords = CMV; CNN; Hopfield; LSTM; MIL; sequence summary = In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. DeepRC sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1D convolutions or LSTMs. In this work, we contribute the following: We demonstrate that continuous generalizations of binary modern Hopfield-networks (Krotov & Hopfield, 2016 Demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. We evaluate the predictive performance of DeepRC and other machine learning approaches for the classification of immune repertoires in a large comparative study (Section "Experimental Results") Exponential storage capacity of continuous state modern Hopfield networks with transformer attention as update rule doi = 10.1101/2020.04.12.038158 id = cord-253436-dz84icdc author = Wille, Michelle title = High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl date = 2016-03-03 keywords = Scaup; sequence summary = In this study we screened 764 samples from 22 avian species of the orders Anseriformes and Charadriiformes in Sweden collected in 2006/2007 for CoV, with an overall CoV prevalence of 18.7%, which is higher than many other wild bird surveys. Coronavirus sequences from Mallards in this study were highly similar to CoV sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. Despite few studies, small samples sizes and differences in prevalence, what is clear, is that in the Northern Hemisphere waterfowl species, especially dabbling and diving ducks are important in the epidemiology of avian CoVs. It is interesting to note that these patterns are very similar to those found in low pathogenic influenza A viruses: high prevalence in waterfowl and gulls in the Northern Hemisphere [30] , and little host species and temporal structuring within waterfowl derived viruses in the conserved polymerase genes (such as PB2, PB1) [31] . doi = 10.1371/journal.pone.0150198 id = cord-280881-5o38ihe0 author = Wlodawer, Alexander title = A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases date = 2003-11-11 keywords = CLN2; enzyme; sedolisin; sequence summary = These structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (Ser, Glu, Asp) and by the presence of an Asp in the oxyanion hole [8] . We have now applied the tools of molecular homology modeling to predicting a structure of CLN2 that could be used as a basis for a search for the biological substrates of this family of enzymes and for the design of specific inhibitors. Mammalian enzymes homologous to human CLN2 [2, 4] form a subfamily of sedolisins with highly conserved sequences ( Figure 1 ). Exploiting the sequence similarity between CLN2, sedolisin, and kumamolisin ( Figure 4 ), we have now used the experimentally obtained structures of the latter two enzymes to form a new, homology-derived model of human CLN2. doi = 10.1186/1472-6807-3-8 id = cord-018963-2lia97db author = Xu, Ying title = Protein Structure Prediction by Protein Threading date = 2010-04-29 keywords = fold; protein; sequence; structure; threading summary = Their follow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now. doi = 10.1007/978-0-387-68825-1_1 id = cord-010499-yefxrj30 author = Yelverton, Elizabeth title = The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli date = 2006-10-27 keywords = Fig; Gallant; HIV; Weiss; sequence summary = Ribosomal frameshifting in both rightward and leftward directions has also been shown to occur at certain ''hungry'' codons whose cognate aminoacyi-tRNAs are in short supply (Gallant and Foley, 1980; Weiss and Gailant, 1983; 1986; Gallant et ai, 1985; Kurland and Gallant, 1986) . Not all hungry codons are equally prone to shift: in a survey of 21 frameshift mutations of the rllB gene of phage T4, Weiss and Gallant (1986) found that oniy a minority were phenotypicaily suppressible when challenged by limitation for any of several aminoacyl-tRNAs. The context njies governing ribosome frameshifting at hungry sites are under investigation, and have been defined in a few cases (Weiss et al., 1988; Gallant and Lindsiey, 1992; Peter et ai. coli the rate of ribosomal frameshifting on that sequence can be increased by limitation for leucine, the amino acid encoded at the frameshift site. doi = 10.1111/j.1365-2958.1994.tb00310.x id = cord-005060-n901y2d4 author = ZHANG, Feiyun title = Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus date = 2001 keywords = ORF; RNA; sequence summary = The largest ORF 2 encodes a polyprotein of 947 amino acids (103.6 kDa), which codes for a serine protease and an RNA-dependent RNA polymerase. The genome sequence of sobernoviruses has been determined in Southern bean mosaic virus (SBMV)''2,24), CfMV8315), Rice yellow mottle virus (RYMV)") and Lucerne transient streak virus (LTSV, accession number U31286). However, the con-served sequence, WAG + E/D rich sequence is detected in the region, and putative E/S cleavage sites are present on both sides of the region : proteolytic cleavage would result in a protein of 9 kDa. Possibly, the VPg of RGMoV is located between the protease and the RNA-dependent RNA polymerase domains in the same order as in the SBMV ORF 222) (Fig. 3) . In the RGMoV RNA sequence, no ORF corresponds to the second largest product of 68 kDa. The putative replicase of CfMV is translated as part of a single polyprotein by -1 ribosomal frameshifting between two overlapping ORFs having a coding capacity for 60.9 kDa and 56.3 kDa proteins7J8). doi = 10.1007/pl00012989 id = cord-340907-j9i1wlak author = Zarai, Yoram title = Evolutionary selection against short nucleotide sequences in viruses and their related hosts date = 2020-04-27 keywords = ZIKV; sequence; supplementary; virus summary = Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. Figure 3A and B depicts the average number of under-represented sequences of size m ¼ 3, 4, and 5 nucleotides, identified in few subsets of viruses in both the original and random variants of the virus. A sampling analysis that we performed (see Supplementary document, Section 2.8) suggests that the number of under-represented sequences identified in dsDNA viruses matches their genomic size, when compared with RNA viruses. To show that the correspondence between selection against short palindromic sequences in viruses and restriction sites cannot be explained by basic coding region features such as amino-acid content and order, codon usage bias and dinucleotide distribution, we also evaluated the overlap between restriction sites and common under-represented sequences of random variants of viruses. doi = 10.1093/dnares/dsaa008 id = cord-266794-oyppubq5 author = Zhang, Dachuan title = SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model date = 2020-09-01 keywords = sequence summary = title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. To circumvent this limitation, we built an integrated 2019-nCoV scientific resource platform and a consensus sequence-catalytic function model with which we developed novel methodology to analyze pathogen sequences for catalytic functions. In addition, we integrated a consensus sequence-function model (Zhang, et al., 2020) , a genome browser (Ham, et al., 2012) , and a catalytic function annotation tool (Dawson, et al., 2017) into the platform to assist in the research of novel viruses. We built an integrated platform to assist 2019-nCoV research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species. doi = 10.1093/bioinformatics/btaa767 id = cord-344782-ond1ziu5 author = Zhang, Jing title = Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi) date = 2018-10-24 keywords = PCR; RNA; River; sequence; virus summary = Nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. Following the detection of the novel virus, in November 2015 (about 6 months after the cessation of the outbreak) an intensive survey of the parts of the river where affected turtles had been detected [2] was undertaken by groups of biologists and ecologists and samples collected from a wide range of aquatic species and some terrestrial animals (n = 360) to establish the size of the remaining population and whether any other animals were carrying this virus. BRV, as a novel nidovirus, was isolated from tissues of diseased animals, very high levels of viral RNA were detected in tissues with marked pathological changes and in situ hybridisation assays demonstrated the presence of specific viral RNA in lesions in kidneys and eye tissue-two of the main affected organs. doi = 10.1371/journal.pone.0205209 id = cord-193910-7p3f3znj author = Zhang, Xiangxie title = Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification date = 2020-11-01 keywords = Levenshtein; dna; feature; sequence summary = In the experiments, the performances of feature extraction using primers and random DNA sequences will be compared to several other machine learning approaches. Finally, three state-of-the-art methods, namely a con-volutional neural network (CNN), a deep neural network (DNN), and an N-gram probabilistic model, which were fed the unprocessed DNA sequences without prior feature extraction, were tested. Different machine learning algorithms will be trained and tested using each set of feature vectors in the experiments. For each data set, the results of all six machine learning algorithms using the random DNA sequence feature extraction method are presented in Table ( 8) containing mean accuracy and standard deviation over the ten folds of the cross-validation. It can be concluded that the Levenshtein distance feature extraction yields the best and most consistent results across the six different machine learning algorithms when the distance between a primer and a DNA sequence is taken. doi = nan id = cord-031957-df4luh5v author = dos Santos-Silva, Carlos André title = Plant Antimicrobial Peptides: State of the Art, In Silico Prediction and Perspectives in the Omics Era date = 2020-09-02 keywords = amp; antimicrobial; figure; model; peptide; pin; plant; protein; sequence; structure summary = doi = 10.1177/1177932220952739 id = cord-001835-0s7ok4uw author = nan title = Abstracts of the 29th Annual Symposium of The Protein Society date = 2015-10-01 keywords = ATP; Biology; Ca21; Chemistry; Department; Institute; NADPH; NMR; PDB; RNA; Science; Tau; University; activity; base; bind; binding; cell; change; complex; design; dna; domain; enzyme; form; function; high; interaction; membrane; method; molecular; peptide; process; protein; residue; result; role; sequence; site; structure; study summary = Altogether, these results indicate that, although PHDs might be more selective for HIF as a substrate as it was initially thought, the enzymatic activity of the prolyl hydroxylases is possibly influenced by a number of other proteins that can directly bind to PHDs. Non-natural aminoacids via the MIO-enzyme toolkit Alina Filip 1 , Judith H Bartha-V ari 1 , Gergely B an oczy 2 , L aszl o Poppe 2 , Csaba Paizs 1 , Florin-Dan Irimie 1 1 Biocatalysis and Biotransformation Research Group, Department of Chemistry, UBB, 2 Department of Organic Chemistry and Technology An attractive enzymatic route to enantiomerically pure to the highly valuable a-or b-aromatic amino acids involves the use of aromatic ammonia lyases (ALs) and aminomutases (AMs). Continuing our studies of the effect of like-charged residues on protein-folding mechanisms, in this work, we investigated, by means of NMR spectroscopy and molecular-dynamics simulations, two short fragments of the human Pin1 WW domain [hPin1(14-24); hPin1(15-23)] and one single point mutation system derived from hPin1(14-24) in which the original charged residues were replaced with non-polar alanine residues. doi = 10.1002/pro.2823 id = cord-004879-pgyzluwp author = nan title = Programmed cell death date = 1994 keywords = ATP; Basel; Bern; Drosophila; Institut; Lausanne; NMDA; PCR; PKC; RNA; Switzerland; TNF; University; acid; activity; cell; dna; expression; gene; high; human; increase; level; mouse; protein; receptor; result; sequence; study; type summary = Furthermore kinetic experiments after complementation of HIV=RT p66 with KIV-RT pSl indicated that HIV-RT pSl can restore rate and extent of strand displacement activity by HIV-RT p66 compared to the HIV-RT heterodimer D66/D51, suggesting a function of the 51 kDa polypeptide, The mouse mammary tumor virus proviral DNA contains an open reading frame in the 3'' long terminal repeat which can code for a 36 kDa polypeptide with a putative transmembrane sequence and five N-linked glycosylation sites. To this end we used constructs encoding the c-fos (and c-jun) genes fused to the hormone-binding domain of the human estrogen receptor, designated c-FosER (and c-JunER), We could show that short-term activation (30 mins.) of c-FosER by estradiole (E2) led to the disruption of epithelial cell polarity within 24 hours, as characterized by the expression of apical and basolateral marker proteins. doi = 10.1007/bf02033112 id = cord-014462-11ggaqf1 author = nan title = Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh date = 2011-04-21 keywords = BTV; CMV; CTV; ELISA; India; PCR; Pradesh; RNA; RTBV; disease; dna; gene; isolate; plant; protein; sequence; study; vaccine; virus summary = Molecular diagnosis based on reverse transcription (RT)-PCR s.a. one step or nested PCR, nucleic acid sequence based amplification (NASBA), or real time RT-PCR, has gradually replaced the virus isolation method as the new standard for the detection of dengue virus in acute phase serum samples. Non-genetic methods of management of these diseases include quarantine measures, eradication of infected plants and weed hosts, crop rotation, use of certified virus-free seed or planting stock and use of pesticides to control insect vector populations implicated in transmission of viruses. The results of this study indicate that NS1 antigen based ELISA test can be an useful tool to detect the dengue virus infection in patients during the early acute phase of disease since appearance of IgM antibodies usually occur after fifth day of the infection. The studies showed high level of expression in case of constructed vector as compared to infected virus for the specific protein. doi = 10.1007/s13337-011-0027-2 id = cord-014674-ey29970v author = nan title = Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 date = 2003 keywords = Gentechnik; dna; sequence summary = title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 We have closely examined the experimental data and the analyses of the nucleotide sequences presented in the report.We find that aside from problematic details of the experimental design and some erratic presentations of the data the results of the study do not provide evidence for the introgression of recombinant DNA from transgenic crop plants into the genomes of ''criollo'' maize. 3. We characterized with the help of BLAST searches those parts of the sequences of the iPCR amplification products that were denoted by Quist and Chapela in their Fig.2 as regions flanking the CMV p-35S sequence.We find that the sequence of AF434754 denoted adh1 in the K1 source of Fig. 2 does not match with the maize adh1 gene. We examined whether the identified regions in the maize genomic DNA from which PCR amplification products were obtained by the authors would perhaps be flanked by primer binding sites. doi = 10.1007/s00103-003-0614-5 id = cord-023208-w99gc5nx author = nan title = Poster Presentation Abstracts date = 2006-09-01 keywords = Fmoc; Gly; HPLC; Lys; NH2; NMR; Pro; RGD; RNA; Tyr; acid; activity; amino; bind; cell; dna; high; interaction; method; peptide; protein; receptor; result; sequence; structure; study summary = In order to develop a synthetic protocol by an automated instrumentation, increasing yield, purity of the crude, and reaction time, a microwave-assisted solid phase peptide synthesis was validated comparing the use of the new generation of Triazine-Based Coupling Reagents (TBCRs) with a series of commonly used ones. Ubiquitinium is a well known mechanism in protein degredation of Eukaryotic cells ,in which many obsolte and corrupted three dimentional structure protein ,become marked by covalent attachment of ubuquitin through a multi-step enzymatic pathway.Ubiquitin is a small ,8.5 kDa peptide of 76 amino acid residues that targets such substrtes for proteolysis in proteasome .Recnt studies showed that an extra cellular ubiquitination process also taking place in the epididymes of humans and other animals marks protein on the surface of the defective sperm .it appears that structurally and functionally defective sperm become surface ubiquitinated by epididymal epithelial cells. This head-to-tailcyclized 14-amino-acid peptide contains one disulfide bridge and a lysine residue (Lys5) present in the P1 position, which is responsible for inhibitor specificity.As was reported by us and other groups, SFTI-1 analogues with one cycle only retain trypsin inhibitory activity. doi = 10.1002/psc.797 id = cord-023209-un2ysc2v author = nan title = Poster Presentations date = 2008-10-07 keywords = Ala; Arg; Asp; Fmoc; Glu; Gly; HPLC; Leu; Lys; NMR; Phe; Thr; Trp; Tyr; University; VEGF; Val; acid; activity; amino; bind; cell; dna; high; peptide; pro; protein; receptor; residue; result; sequence; structure; study; synthesis summary = Site-specifi c PEGylation of human IgG1-Fab using a rationally designed trypsin variant In the present contribution we report on a novel, highly selective biocatalytic method enabling C-terminal modifi cations of proteins with artifi cial functionalities under native state conditions. Recently, our group report a novel approach to a totally synthetic vaccine which consists of FMDV (Foot and Mouth Disease Virus) VP1 peptides, prepared by covalent conjugation of peptide biomolecules with membrane active carbochain polyelectrolytes In the present study, peptide epitops of VP1 protein both 135-161(P1) amino acid residues (Ser-Lys-Tyr-Ser-Thr-Thr-Gly-Glu-Arg-Thr-Arg-Thr-Arg-Gly-Asp-Leu-Gly-Ala-Leu-Ala-Ala-Arg-Val-Ala-Thr-Gln-Leu-Pro-Ala) and triptophan (Trp) containing on the N terminus 135-161 amino acid residues (Trp-135-161) (P2) were synthesized by using the microwave assisted solid-phase methods. Using as a template a peptide, already identifi ed, with agonist activity against PTPRJ(H-[Cys-His-His-Asn-Leu-Thr-His-Ala-Cys]-OH), here we report a structure-activity study carried out through endocyclic modifi cations (Ala-scan, D-substitutions, single residue deletions, substitutions of the disulfi de bridge) and the preliminary biological results of this set of compounds. doi = 10.1002/psc.1090 id = cord-023647-dlqs8ay9 author = nan title = Sequences and topology date = 2003-03-21 keywords = Evolution; Family; Gene; Human; Protein; acid; sequence summary = Nucleotide Sequence Analysis of the L G~ne of Vesicular Stomafltia Virus (New Jersey Serotype) --Identification of Conserved Domai~L~ in L Proteins of Nonsegmented Negative-Strand RNA Viruses DERSE I~ Equine Infectious Anemia Virus tat--Insights into the Structure, Function, and Evolution of Lentivtrus tran.~Activator Proteins Ho~tu~ ~ s71 is a Ehylngcueticellly Distinct Human Endogenous Reteovtgal 1Rlement with Structural mad Sequence Homology to Simian Sarcoma Virus (SSV). Distinct Fercedoxins from Rhodobacter-Capsulstus -Complete Amino Acid Sequences and Molecular Evolution Complete Amino Acid Sequence and Homologies of Human Erythrocyte Membrane Protein Band 4.2. Identification of Two Highly Conserved Amino Acid Sequences Amon~ the ~x-subunits and Molecular ~ The Predicted Amino Acid Sequence of ct-lnternexin is that of a novel Neuronal lntegmedla~ ~ent Protein Inttaspecific Evolution of a Gene Family Coding for Urinary Proteins Attalysi~ of CDNA for Human ~ AJudgyrin I~dicltes a Repeated Structure with Homology to Tissue-Differentiation a~td Cell-Cycle Control Protein doi = 10.1016/0959-440x(91)90051-t id = cord-300796-rmjv56ia author = nan title = The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation date = 1990-09-01 keywords = Fig; p62; protein; sequence summary = In this work we show that the p62 protein of Semliki Forest virus contains an uncleaved signal sequence at its NH2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. As the glycosylation of the signal sequence most likely occurs after its release from the ER membrane our results suggest that this region has no role in completing the transfer process. Furthermore, the p62-reporter hybrid should be translocated across microsomal membranes and possibly glycosylated at Asn~3 of the p62 sequence if the 40 residues long NH2-terminal p62 peptide carries a signal sequence. This must involve Asn~3 of the p62 peptide as it is part of the only potential glycosylation site on the hybrid polypeptides (Garoff et al., 1980 ; references on dhfr sequence in legend to Fig. 1) , Finally, we can also conclude that the p62 signal sequence does not provide a stable membrane anchor to the translocated chain. doi = nan id = cord-256608-ajzk86rq author = van Weezep, Erik title = PCR diagnostics: In silico validation by an automated tool using freely available software programs date = 2019-05-13 keywords = PCR; sequence; silico summary = An alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the PCR test as search queries and the program SSEARCH available in the FASTA sequence analysis package (Brenner et al., 1998; Pearson, 1991; Pearson et al., 2017; . The in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective PCR test. To demonstrate the suitability of our in-house developed software tool PCRv, we determined the in silico sensitivity and specificity of three PCR tests for West Nile virus (WNV) recommended by the World Organisation for Animal Health (OIE) (Eiden et al., 2010; Johnson et al., 2001) . doi = 10.1016/j.jviromet.2019.05.002