Carrel name: keyword-sequence-cord Creating study carrel named keyword-sequence-cord Initializing database file: cache/cord-000257-ampip7od.json key: cord-000257-ampip7od authors: Bagowski, Christoph P; Bruins, Wouter; te Velthuis, Aartjan J.W title: The Nature of Protein Domain Evolution: Shaping the Interaction Network date: 2010-08-17 journal: Curr Genomics DOI: 10.2174/138920210791616725 sha: doc_id: 257 cord_uid: ampip7od file: cache/cord-016293-pyb00pt5.json key: cord-016293-pyb00pt5 authors: Newell-McGloughlin, Martina; Re, Edward title: The flowering of the age of Biotechnology 1990–2000 date: 2006 journal: The Evolution of Biotechnology DOI: 10.1007/1-4020-5149-2_4 sha: doc_id: 16293 cord_uid: pyb00pt5 file: cache/cord-016798-tv2ntug6.json key: cord-016798-tv2ntug6 authors: Gautam, Ablesh; Tiwari, Ashish; Malik, Yashpal Singh title: Bioinformatics Applications in Advancing Animal Virus Research date: 2019-06-06 journal: Recent Advances in Animal Virology DOI: 10.1007/978-981-13-9073-9_23 sha: doc_id: 16798 cord_uid: tv2ntug6 file: cache/cord-000473-jpow6iw1.json key: cord-000473-jpow6iw1 authors: Astrovskaya, Irina; Tork, Bassam; Mangul, Serghei; Westbrooks, Kelly; Măndoiu, Ion; Balfe, Peter; Zelikovsky, Alex title: Inferring viral quasispecies spectra from 454 pyrosequencing reads date: 2011-07-28 journal: BMC Bioinformatics DOI: 10.1186/1471-2105-12-s6-s1 sha: doc_id: 473 cord_uid: jpow6iw1 file: cache/cord-025610-7vouj8pp.json key: cord-025610-7vouj8pp authors: Latif, Seemab; Bashir, Sarmad; Agha, Mir Muntasar Ali; Latif, Rabia title: Backward-Forward Sequence Generative Network for Multiple Lexical Constraints date: 2020-05-06 journal: Artificial Intelligence Applications and Innovations DOI: 10.1007/978-3-030-49186-4_4 sha: doc_id: 25610 cord_uid: 7vouj8pp file: cache/cord-004862-yv76yvy5.json key: cord-004862-yv76yvy5 authors: Demers, G. William; Matunis, Michael J.; Hardison, Ross C. title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin date: 1989 journal: J Mol Evol DOI: 10.1007/bf02106177 sha: doc_id: 4862 cord_uid: yv76yvy5 file: cache/cord-025948-6dsx7pey.json key: cord-025948-6dsx7pey authors: Maitra, Arindam; Sarkar, Mamta Chawla; Raheja, Harsha; Biswas, Nidhan K; Chakraborti, Sohini; Singh, Animesh Kumar; Ghosh, Shekhar; Sarkar, Sumanta; Patra, Subrata; Mondal, Rajiv Kumar; Ghosh, Trinath; Chatterjee, Ananya; Banu, Hasina; Majumdar, Agniva; Chinnaswamy, Sreedhar; Srinivasan, Narayanaswamy; Dutta, Shanta; Das, Saumitra title: Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility date: 2020-06-04 journal: J Biosci DOI: 10.1007/s12038-020-00046-1 sha: doc_id: 25948 cord_uid: 6dsx7pey file: cache/cord-014674-ey29970v.json key: cord-014674-ey29970v authors: nan title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 date: 2003 journal: Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz DOI: 10.1007/s00103-003-0614-5 sha: doc_id: 14674 cord_uid: ey29970v file: cache/cord-018459-isbc1r2o.json key: cord-018459-isbc1r2o authors: Munjal, Geetika; Hanmandlu, Madasu; Srivastava, Sangeet title: Phylogenetics Algorithms and Applications date: 2018-12-10 journal: Ambient Communications and Computer Systems DOI: 10.1007/978-981-13-5934-7_17 sha: doc_id: 18459 cord_uid: isbc1r2o file: cache/cord-015850-ef6svn8f.json key: cord-015850-ef6svn8f authors: Saitou, Naruya title: Eukaryote Genomes date: 2013-08-22 journal: Introduction to Evolutionary Genomics DOI: 10.1007/978-1-4471-5304-7_8 sha: doc_id: 15850 cord_uid: ef6svn8f file: cache/cord-012975-u87ol3fs.json key: cord-012975-u87ol3fs authors: Ogiwara, Atsushi; Uchiyama, Ikuo; Seto, Yasuhiko; Kanehisa, Minoru title: Construction of a dictionary of sequence motifs that characterize groups of related proteins date: 1992-09-17 journal: Protein Eng DOI: 10.1093/protein/5.6.479 sha: doc_id: 12975 cord_uid: u87ol3fs file: cache/cord-033010-o5kiadfm.json key: cord-033010-o5kiadfm authors: Durojaye, Olanrewaju Ayodeji; Mushiana, Talifhani; Uzoeto, Henrietta Onyinye; Cosmas, Samuel; Udowo, Victor Malachy; Osotuyi, Abayomi Gaius; Ibiang, Glory Omini; Gonlepa, Miapeh Kous title: Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study date: 2020-10-02 journal: Egypt J Med Hum Genet DOI: 10.1186/s43042-020-00081-5 sha: doc_id: 33010 cord_uid: o5kiadfm file: cache/cord-256608-ajzk86rq.json key: cord-256608-ajzk86rq authors: van Weezep, Erik; Kooi, Engbert A.; van Rijn, Piet A. title: PCR diagnostics: In silico validation by an automated tool using freely available software programs date: 2019-05-13 journal: J Virol Methods DOI: 10.1016/j.jviromet.2019.05.002 sha: doc_id: 256608 cord_uid: ajzk86rq file: cache/cord-103029-nc5yf6x4.json key: cord-103029-nc5yf6x4 authors: Wichmann, Stefan; Scherer, Siegfried; Ardern, Zachary title: Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank date: 2020-09-25 journal: bioRxiv DOI: 10.1101/2020.09.25.312959 sha: doc_id: 103029 cord_uid: nc5yf6x4 file: cache/cord-001340-kqcx7lrq.json key: cord-001340-kqcx7lrq authors: Ladner, Jason T.; Beitzel, Brett; Chain, Patrick S. G.; Davenport, Matthew G.; Donaldson, Eric; Frieman, Matthew; Kugelman, Jeffrey; Kuhn, Jens H.; O’Rear, Jules; Sabeti, Pardis C.; Wentworth, David E.; Wiley, Michael R.; Yu, Guo-Yun; Sozhamannan, Shanmuga; Bradburne, Christopher; Palacios, Gustavo title: Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing date: 2014-06-17 journal: mBio DOI: 10.1128/mbio.01360-14 sha: doc_id: 1340 cord_uid: kqcx7lrq file: cache/cord-002473-2kpxhzbe.json key: cord-002473-2kpxhzbe authors: Das, Jayanta Kumar; Pal Choudhury, Pabitra title: Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach date: 2017-03-31 journal: PLoS One DOI: 10.1371/journal.pone.0175031 sha: doc_id: 2473 cord_uid: 2kpxhzbe file: cache/cord-010260-8lnpujip.json key: cord-010260-8lnpujip authors: Anthonsen, Henrik W.; Baptista, António; Drabløs, Finn; Martel, Paulo; Petersen, Steffen B. title: The blind watchmaker and rational protein engineering date: 1994-08-31 journal: J Biotechnol DOI: 10.1016/0168-1656(94)90152-x sha: doc_id: 10260 cord_uid: 8lnpujip file: cache/cord-010161-bcuec2fz.json key: cord-010161-bcuec2fz authors: Matson, David O. title: IV, 6. Calicivirus RNA recombination date: 2004-09-14 journal: Perspect Med Virol DOI: 10.1016/s0168-7069(03)09032-3 sha: doc_id: 10161 cord_uid: bcuec2fz file: cache/cord-017584-9rx4jlw8.json key: cord-017584-9rx4jlw8 authors: Kim, Kwangsoo; Ryoo, Hong Seo title: Selecting Genotyping Oligo Probes Via Logical Analysis of Data date: 2007 journal: Advances in Artificial Intelligence DOI: 10.1007/978-3-540-72665-4_8 sha: doc_id: 17584 cord_uid: 9rx4jlw8 file: cache/cord-005060-n901y2d4.json key: cord-005060-n901y2d4 authors: ZHANG, Feiyun; TORIYAMA, Shigemitsu; TAKAHASHI, Mami title: Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus date: 2001 journal: J DOI: 10.1007/pl00012989 sha: doc_id: 5060 cord_uid: n901y2d4 file: cache/cord-011565-8ncgldaq.json key: cord-011565-8ncgldaq authors: Elworth, R A Leo; Wang, Qi; Kota, Pavan K; Barberan, C J; Coleman, Benjamin; Balaji, Advait; Gupta, Gaurav; Baraniuk, Richard G; Shrivastava, Anshumali; Treangen, Todd J title: To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics date: 2020-06-04 journal: Nucleic Acids Res DOI: 10.1093/nar/gkaa265 sha: doc_id: 11565 cord_uid: 8ncgldaq file: cache/cord-001537-i34vmfpp.json key: cord-001537-i34vmfpp authors: Lima, Francisco Esmaile de Sales; Cibulski, Samuel Paulo; dos Santos, Helton Fernandes; Teixeira, Thais Fumaco; Varela, Ana Paula Muterle; Roehe, Paulo Michel; Delwart, Eric; Franco, Ana Cláudia title: Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil date: 2015-02-17 journal: PLoS One DOI: 10.1371/journal.pone.0118070 sha: doc_id: 1537 cord_uid: i34vmfpp file: cache/cord-256278-jvfjf7aw.json key: cord-256278-jvfjf7aw authors: Feng, Jie; Hu, Yong; Wan, Ping; Zhang, Aibing; Zhao, Weizhong title: New method for comparing DNA primary sequences based on a discrimination measure date: 2010-10-21 journal: Journal of Theoretical Biology DOI: 10.1016/j.jtbi.2010.07.040 sha: doc_id: 256278 cord_uid: jvfjf7aw file: cache/cord-000642-mkwpuav6.json key: cord-000642-mkwpuav6 authors: Moreira, Rebeca; Balseiro, Pablo; Planas, Josep V.; Fuste, Berta; Beltran, Sergi; Novoa, Beatriz; Figueras, Antonio title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing date: 2012-04-19 journal: PLoS One DOI: 10.1371/journal.pone.0035009 sha: doc_id: 642 cord_uid: mkwpuav6 file: cache/cord-255194-4i9fc0r7.json key: cord-255194-4i9fc0r7 authors: Djikeng, Appolinaire; Halpin, Rebecca; Kuzmickas, Ryan; DePasse, Jay; Feldblyum, Jeremy; Sengamalay, Naomi; Afonso, Claudio; Zhang, Xinsheng; Anderson, Norman G; Ghedin, Elodie; Spiro, David J title: Viral genome sequencing by random priming methods date: 2008-01-07 journal: BMC Genomics DOI: 10.1186/1471-2164-9-5 sha: doc_id: 255194 cord_uid: 4i9fc0r7 file: cache/cord-016594-lj0us1dq.json key: cord-016594-lj0us1dq authors: Flower, Darren R.; Davies, Matthew N.; Doytchinova, Irini A. title: Identification of Candidate Vaccine Antigens In Silico date: 2012-09-28 journal: Immunomic Discovery of Adjuvants and Candidate Subunit Vaccines DOI: 10.1007/978-1-4614-5070-2_3 sha: doc_id: 16594 cord_uid: lj0us1dq file: cache/cord-023647-dlqs8ay9.json key: cord-023647-dlqs8ay9 authors: nan title: Sequences and topology date: 2003-03-21 journal: Curr Opin Struct Biol DOI: 10.1016/0959-440x(91)90051-t sha: doc_id: 23647 cord_uid: dlqs8ay9 file: cache/cord-022348-w7z97wir.json key: cord-022348-w7z97wir authors: Sola, Monica; Wain-Hobson, Simon title: Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing? date: 2007-09-02 journal: Origin and Evolution of Viruses DOI: 10.1016/b978-012220360-2/50007-6 sha: doc_id: 22348 cord_uid: w7z97wir file: cache/cord-264296-0x90yubt.json key: cord-264296-0x90yubt authors: Sawmya, Shashata; Saha, Arpita; Tasnim, Sadia; Anjum, Naser; Toufikuzzaman, Md.; Rafid, Ali Haisam Muhammad; Rahman, Mohammad Saifur; Rahman, M. Sohel title: Analyzing hCov genome sequences: Applying Machine Intelligence and beyond date: 2020-06-03 journal: bioRxiv DOI: 10.1101/2020.06.03.131987 sha: doc_id: 264296 cord_uid: 0x90yubt file: cache/cord-035033-osjy88rc.json key: cord-035033-osjy88rc authors: Aydin, Berkay; Boubrahimi, Soukaina Filali; Kucuk, Ahmet; Nezamdoust, Bita; Angryk, Rafal A. title: Spatiotemporal event sequence discovery without thresholds date: 2020-11-09 journal: Geoinformatica DOI: 10.1007/s10707-020-00427-6 sha: doc_id: 35033 cord_uid: osjy88rc file: cache/cord-203232-1nnqx1g9.json key: cord-203232-1nnqx1g9 authors: Canturk, Semih; Singh, Aman; St-Amant, Patrick; Behrmann, Jason title: Machine-Learning Driven Drug Repurposing for COVID-19 date: 2020-06-25 journal: nan DOI: nan sha: doc_id: 203232 cord_uid: 1nnqx1g9 file: cache/cord-264135-s2u76pvk.json key: cord-264135-s2u76pvk authors: Patel, Amrutlal K.; Pandit, Ramesh J.; Thakkar, Jalpa R.; Hinsu, Ankit T.; Pandey, Vinod C.; Pal, Joy K.; Prajapati, Kantilal S.; Jakhesara, Subhash J.; Joshi, Chaitanya G. title: Complete genome sequence analysis of chicken astrovirus isolate from India date: 2016-12-23 journal: Vet Res Commun DOI: 10.1007/s11259-016-9673-6 sha: doc_id: 264135 cord_uid: s2u76pvk file: cache/cord-266288-buc4dd5y.json key: cord-266288-buc4dd5y authors: Dong, Rui; He, Lily; He, Rong Lucy; Yau, Stephen S.-T. title: A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance date: 2019-04-09 journal: Front Genet DOI: 10.3389/fgene.2019.00234 sha: doc_id: 266288 cord_uid: buc4dd5y file: cache/cord-001786-ybd8hi8y.json key: cord-001786-ybd8hi8y authors: Dutilh, Bas E title: Metagenomic ventures into outer sequence space date: 2014-12-15 journal: Bacteriophage DOI: 10.4161/21597081.2014.979664 sha: doc_id: 1786 cord_uid: ybd8hi8y file: cache/cord-018133-2otxft31.json key: cord-018133-2otxft31 authors: Altman, Russ B.; Mooney, Sean D. title: Bioinformatics date: 2006 journal: Biomedical Informatics DOI: 10.1007/0-387-36278-9_22 sha: doc_id: 18133 cord_uid: 2otxft31 file: cache/cord-266960-kyx6xhvj.json key: cord-266960-kyx6xhvj authors: Temple, Mark D. title: Real-time audio and visual display of the Coronavirus genome date: 2020-10-02 journal: BMC Bioinformatics DOI: 10.1186/s12859-020-03760-7 sha: doc_id: 266960 cord_uid: kyx6xhvj file: cache/cord-003316-r5te5xob.json key: cord-003316-r5te5xob authors: Balloux, Francois; Brønstad Brynildsrud, Ola; van Dorp, Lucy; Shaw, Liam P.; Chen, Hongbin; Harris, Kathryn A.; Wang, Hui; Eldholm, Vegard title: From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic date: 2018-12-17 journal: Trends Microbiol DOI: 10.1016/j.tim.2018.08.004 sha: doc_id: 3316 cord_uid: r5te5xob file: cache/cord-300796-rmjv56ia.json key: cord-300796-rmjv56ia authors: nan title: The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation date: 1990-09-01 journal: J Cell Biol DOI: nan sha: doc_id: 300796 cord_uid: rmjv56ia file: cache/cord-017932-vmtjc8ct.json key: cord-017932-vmtjc8ct authors: Georgiev, Vassil St. title: Genomic and Postgenomic Research date: 2009 journal: National Institute of Allergy and Infectious Diseases, NIH DOI: 10.1007/978-1-60327-297-1_25 sha: doc_id: 17932 cord_uid: vmtjc8ct file: cache/cord-265857-fs6dj3dp.json key: cord-265857-fs6dj3dp authors: Liu, Yu-Tsueng title: Infectious Disease Genomics date: 2010-12-24 journal: Genetics and Evolution of Infectious Disease DOI: 10.1016/b978-0-12-384890-1.00010-8 sha: doc_id: 265857 cord_uid: fs6dj3dp file: cache/cord-010273-0c56x9f5.json key: cord-010273-0c56x9f5 authors: Simmonds, Peter title: Virology of hepatitis C virus date: 2001-10-10 journal: Clin Ther DOI: 10.1016/s0149-2918(96)80193-7 sha: doc_id: 10273 cord_uid: 0c56x9f5 file: cache/cord-010499-yefxrj30.json key: cord-010499-yefxrj30 authors: Yelverton, Elizabeth; Lindsley, Dale; Yamauchi, Phil; Gallant, Jonathan A. title: The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli date: 2006-10-27 journal: Mol Microbiol DOI: 10.1111/j.1365-2958.1994.tb00310.x sha: doc_id: 10499 cord_uid: yefxrj30 file: cache/cord-263987-ff6kor0c.json key: cord-263987-ff6kor0c authors: Holmes, Ian H. title: Solving the master equation for Indels date: 2017-05-12 journal: BMC Bioinformatics DOI: 10.1186/s12859-017-1665-1 sha: doc_id: 263987 cord_uid: ff6kor0c file: cache/cord-022494-d66rz6dc.json key: cord-022494-d66rz6dc authors: Webb, B.; Eswar, N.; Fan, H.; Khuri, N.; Pieper, U.; Dong, G.Q.; Sali, A. title: Comparative Modeling of Drug Target Proteins date: 2014-10-01 journal: Reference Module in Chemistry, Molecular Sciences and Chemical Engineering DOI: 10.1016/b978-0-12-409547-2.11133-3 sha: doc_id: 22494 cord_uid: d66rz6dc file: cache/cord-193910-7p3f3znj.json key: cord-193910-7p3f3znj authors: Zhang, Xiangxie; Beinke, Ben; Kindhi, Berlian Al; Wiering, Marco title: Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification date: 2020-11-01 journal: nan DOI: nan sha: doc_id: 193910 cord_uid: 7p3f3znj file: cache/cord-253436-dz84icdc.json key: cord-253436-dz84icdc authors: Wille, Michelle; Muradrasoli, Shaman; Nilsson, Anna; Järhult, Josef D. title: High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl date: 2016-03-03 journal: PLoS One DOI: 10.1371/journal.pone.0150198 sha: doc_id: 253436 cord_uid: dz84icdc file: cache/cord-255371-o9oxchq6.json key: cord-255371-o9oxchq6 authors: Nguyen, Thanh Thi; Pathirana, Pubudu N.; Nguyen, Thin; Nguyen, Henry; Bhatti, Asim; Nguyen, Dinh C.; Nguyen, Dung Tien; Nguyen, Ngoc Duy; Creighton, Douglas; Abdelrazek, Mohamed title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) date: 2020-07-10 journal: bioRxiv DOI: 10.1101/2020.07.10.171769 sha: doc_id: 255371 cord_uid: o9oxchq6 file: cache/cord-017354-cndb031c.json key: cord-017354-cndb031c authors: Janies, D.; Pol, D. title: Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases date: 2008 journal: Tutorials in Mathematical Biosciences IV DOI: 10.1007/978-3-540-74331-6_2 sha: doc_id: 17354 cord_uid: cndb031c file: cache/cord-014461-2ubh9u8r.json key: cord-014461-2ubh9u8r authors: Nelson, Oranmiyan W.; Garrity, George M. title: Genome sequences published outside of Standards in Genomic Sciences, July - October 2012 date: 2012-10-10 journal: Stand Genomic Sci DOI: 10.4056/sigs.3416907 sha: doc_id: 14461 cord_uid: 2ubh9u8r file: cache/cord-014462-11ggaqf1.json key: cord-014462-11ggaqf1 authors: nan title: Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh date: 2011-04-21 journal: Indian J Virol DOI: 10.1007/s13337-011-0027-2 sha: doc_id: 14462 cord_uid: 11ggaqf1 file: cache/cord-268549-2lg8i9r1.json key: cord-268549-2lg8i9r1 authors: Dai, Qi; Guo, Xiaodong; Li, Lihua title: Sequence comparison via polar coordinates representation and curve tree date: 2012-01-07 journal: Journal of Theoretical Biology DOI: 10.1016/j.jtbi.2011.09.030 sha: doc_id: 268549 cord_uid: 2lg8i9r1 file: cache/cord-001974-wjf3c7a7.json key: cord-001974-wjf3c7a7 authors: Friis-Nielsen, Jens; Kjartansdóttir, Kristín Rós; Mollerup, Sarah; Asplund, Maria; Mourier, Tobias; Jensen, Randi Holm; Hansen, Thomas Arn; Rey-Iglesia, Alba; Richter, Stine Raith; Nielsen, Ida Broman; Alquezar-Planas, David E.; Olsen, Pernille V. S.; Vinner, Lasse; Fridholm, Helena; Nielsen, Lars Peter; Willerslev, Eske; Sicheritz-Pontén, Thomas; Lund, Ole; Hansen, Anders Johannes; Izarzugaza, Jose M. G.; Brunak, Søren title: Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers date: 2016-02-19 journal: Viruses DOI: 10.3390/v8020053 sha: doc_id: 1974 cord_uid: wjf3c7a7 file: cache/cord-275258-azpg5yrh.json key: cord-275258-azpg5yrh authors: Mead, Dylan J.T.; Lunagomez, Simón; Gatherer, Derek title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling date: 2019-07-26 journal: J Mol Graph Model DOI: 10.1016/j.jmgm.2019.07.014 sha: doc_id: 275258 cord_uid: azpg5yrh file: cache/cord-321386-u1imic5l.json key: cord-321386-u1imic5l authors: Li, Chun; Zhao, Jialing; Wang, Changzhong; Yao, Yuhua title: Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation date: 2018-02-17 journal: Comb Chem High Throughput Screen DOI: 10.2174/1386207321666180130100838 sha: doc_id: 321386 cord_uid: u1imic5l file: cache/cord-023208-w99gc5nx.json key: cord-023208-w99gc5nx authors: nan title: Poster Presentation Abstracts date: 2006-09-01 journal: J Pept Sci DOI: 10.1002/psc.797 sha: doc_id: 23208 cord_uid: w99gc5nx file: cache/cord-306725-0vam15pt.json key: cord-306725-0vam15pt authors: Li, Hao; Zhang, Bin; Yue, Hua; Tang, Cheng title: First detection and genomic characteristics of bovine torovirus in dairy calves in China date: 2020-05-09 journal: Arch Virol DOI: 10.1007/s00705-020-04657-9 sha: doc_id: 306725 cord_uid: 0vam15pt file: cache/cord-027316-echxuw74.json key: cord-027316-echxuw74 authors: Modarresi, Kourosh title: Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model date: 2020-05-22 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50420-5_20 sha: doc_id: 27316 cord_uid: echxuw74 file: cache/cord-213136-euv6pqh5.json key: cord-213136-euv6pqh5 authors: Singh, Kulveer; Rabin, Yitzhak title: Sequence Effects on Internal Structure of Droplets of Associative Polymers date: 2020-05-17 journal: nan DOI: nan sha: doc_id: 213136 cord_uid: euv6pqh5 file: cache/cord-103297-4stnx8dw.json key: cord-103297-4stnx8dw authors: Widrich, Michael; Schäfl, Bernhard; Pavlović, Milena; Ramsauer, Hubert; Gruber, Lukas; Holzleitner, Markus; Brandstetter, Johannes; Sandve, Geir Kjetil; Greiff, Victor; Hochreiter, Sepp; Klambauer, Günter title: Modern Hopfield Networks and Attention for Immune Repertoire Classification date: 2020-08-17 journal: bioRxiv DOI: 10.1101/2020.04.12.038158 sha: doc_id: 103297 cord_uid: 4stnx8dw key: cord-193356-hqbstgg7 authors: Widrich, Michael; Schafl, Bernhard; Ramsauer, Hubert; Pavlovi'c, Milena; Gruber, Lukas; Holzleitner, Markus; Brandstetter, Johannes; Sandve, Geir Kjetil; Greiff, Victor; Hochreiter, Sepp; Klambauer, Gunter title: Modern Hopfield Networks and Attention for Immune Repertoire Classification date: 2020-07-16 journal: nan DOI: nan sha: doc_id: 193356 cord_uid: hqbstgg7 file: cache/cord-252347-vnn4135b.json key: cord-252347-vnn4135b authors: Lee, Wai-Ming; Kiesner, Christin; Pappas, Tressa; Lee, Iris; Grindle, Kris; Jartti, Tuomas; Jakiela, Bogdan; Lemanske, Robert F.; Shult, Peter A.; Gern, James E. title: A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants date: 2007-10-03 journal: PLoS One DOI: 10.1371/journal.pone.0000966 sha: doc_id: 252347 cord_uid: vnn4135b file: cache/cord-031957-df4luh5v.json key: cord-031957-df4luh5v authors: dos Santos-Silva, Carlos André; Zupin, Luisa; Oliveira-Lima, Marx; Vilela, Lívia Maria Batista; Bezerra-Neto, João Pacifico; Ferreira-Neto, José Ribamar; Ferreira, José Diogo Cavalcanti; de Oliveira-Silva, Roberta Lane; Pires, Carolline de Jesús; Aburjaile, Flavia Figueira; de Oliveira, Marianne Firmino; Kido, Ederson Akio; Crovella, Sergio; Benko-Iseppon, Ana Maria title: Plant Antimicrobial Peptides: State of the Art, In Silico Prediction and Perspectives in the Omics Era date: 2020-09-02 journal: Bioinform Biol Insights DOI: 10.1177/1177932220952739 sha: doc_id: 31957 cord_uid: df4luh5v file: cache/cord-264746-gfn312aa.json key: cord-264746-gfn312aa authors: Muse, Spencer title: GENOMICS AND BIOINFORMATICS date: 2012-03-29 journal: Introduction to Biomedical Engineering DOI: 10.1016/b978-0-12-238662-6.50015-x sha: doc_id: 264746 cord_uid: gfn312aa file: cache/cord-267500-x3u9i1vq.json key: cord-267500-x3u9i1vq authors: Rose, Rebecca; Constantinides, Bede; Tapinos, Avraam; Robertson, David L; Prosperi, Mattia title: Challenges in the analysis of viral metagenomes date: 2016-08-03 journal: Virus Evol DOI: 10.1093/ve/vew022 sha: doc_id: 267500 cord_uid: x3u9i1vq file: cache/cord-311240-o0zyt2vb.json key: cord-311240-o0zyt2vb authors: Motayo, Babatunde Olarenwaju; Oluwasemowo, Olukunle Oluwapamilerin; Akinduti, Paul Akiniyi; Olusola, Babatunde Adebiyi; Aerege, Olumide T; Faneye, Adedayo Omotayo title: Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences date: 2020-07-27 journal: bioRxiv DOI: 10.1101/2020.07.27.222901 sha: doc_id: 311240 cord_uid: o0zyt2vb file: cache/cord-321715-bkfkmtld.json key: cord-321715-bkfkmtld authors: Redelings, Benjamin D; Suchard, Marc A title: Incorporating indel information into phylogeny estimation for rapidly emerging pathogens date: 2007-03-14 journal: BMC Evol Biol DOI: 10.1186/1471-2148-7-40 sha: doc_id: 321715 cord_uid: bkfkmtld file: cache/cord-311839-61djk4bs.json key: cord-311839-61djk4bs authors: Wei, Dan; Jiang, Qingshan; Wei, Yanjie; Wang, Shengrui title: A novel hierarchical clustering algorithm for gene sequences date: 2012-07-23 journal: BMC Bioinformatics DOI: 10.1186/1471-2105-13-174 sha: doc_id: 311839 cord_uid: 61djk4bs file: cache/cord-018963-2lia97db.json key: cord-018963-2lia97db authors: Xu, Ying; Liu, Zhijie; Cai, Liming; Xu, Dong title: Protein Structure Prediction by Protein Threading date: 2010-04-29 journal: Computational Methods for Protein Structure Prediction and Modeling DOI: 10.1007/978-0-387-68825-1_1 sha: doc_id: 18963 cord_uid: 2lia97db file: cache/cord-321762-7kiahjyy.json key: cord-321762-7kiahjyy authors: Nandy, Ashesh title: Chapter 5 The GRANCH Techniques for Analysis of DNA, RNA and Protein Sequences date: 2015-12-31 journal: Advances in Mathematical Chemistry and Applications DOI: 10.1016/b978-1-68108-053-6.50005-3 sha: doc_id: 321762 cord_uid: 7kiahjyy file: cache/cord-102766-n6mpdhyu.json key: cord-102766-n6mpdhyu authors: Alam, Md. Nafis Ul; Chowdhury, Umar Faruq title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses date: 2020-06-25 journal: bioRxiv DOI: 10.1101/2020.06.25.170779 sha: doc_id: 102766 cord_uid: n6mpdhyu file: cache/cord-254942-g51mjj2b.json key: cord-254942-g51mjj2b authors: Touati, Rabeb; Tajouri, Asma; Mesaoudi, Imen; Oueslati, Afef Elloumi; Lachiri, Zied; Kharrat, Maher title: New methodology for repetitive sequences identification in human X and Y chromosomes date: 2020-10-19 journal: Biomed Signal Process Control DOI: 10.1016/j.bspc.2020.102207 sha: doc_id: 254942 cord_uid: g51mjj2b file: cache/cord-321150-ev6acl7b.json key: cord-321150-ev6acl7b authors: Lam, Ha Minh; Ratmann, Oliver; Boni, Maciej F title: Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm date: 2017-10-03 journal: Mol Biol Evol DOI: 10.1093/molbev/msx263 sha: doc_id: 321150 cord_uid: ev6acl7b file: cache/cord-302798-q0mbngqy.json key: cord-302798-q0mbngqy authors: Ge, Junwei; Gu, Shanshan; Cui, Xingyang; Zhao, Lili; Ma, Dexing; Shi, Yunjia; Wang, Yuanzhi; Lu, Taofeng; Chen, Hongyan title: Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China date: 2018-06-14 journal: Arch Virol DOI: 10.1007/s00705-018-3908-5 sha: doc_id: 302798 cord_uid: q0mbngqy file: cache/cord-266794-oyppubq5.json key: cord-266794-oyppubq5 authors: Zhang, Dachuan; Zhang, Tong; Liu, Sheng; Sun, Dandan; Ding, Shaozhen; Cheng, Xingxiang; Cai, Pengli; Ren, Ailin; Han, Mengying; Liu, Dongliang; Jia, Cancan; Gong, Linlin; Zhang, Rui; Xing, Huadong; Tu, Weizhong; Chen, Junni; Hu, Qian-Nan title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model date: 2020-09-01 journal: Bioinformatics DOI: 10.1093/bioinformatics/btaa767 sha: doc_id: 266794 cord_uid: oyppubq5 file: cache/cord-300807-9u8idlon.json key: cord-300807-9u8idlon authors: Tong, Joo Chuan; Ranganathan, Shoba title: 7 Infectious disease informatics date: 2013-12-31 journal: Computer-Aided Vaccine Design DOI: 10.1533/9781908818416.99 sha: doc_id: 300807 cord_uid: 9u8idlon file: cache/cord-280881-5o38ihe0.json key: cord-280881-5o38ihe0 authors: Wlodawer, Alexander; Durell, Stewart R; Li, Mi; Oyama, Hiroshi; Oda, Kohei; Dunn, Ben M title: A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases date: 2003-11-11 journal: BMC Struct Biol DOI: 10.1186/1472-6807-3-8 sha: doc_id: 280881 cord_uid: 5o38ihe0 file: cache/cord-274056-9t3kneoo.json key: cord-274056-9t3kneoo authors: Abd Elwahaab, Marwa A.; Abo-Elkhier, Mervat M.; Abo el Maaty, Moheb I. title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector date: 2019-05-08 journal: Biomed Res Int DOI: 10.1155/2019/8702968 sha: doc_id: 274056 cord_uid: 9t3kneoo file: cache/cord-325985-xfzhn1n1.json key: cord-325985-xfzhn1n1 authors: Jabado, Omar J.; Liu, Yang; Conlan, Sean; Quan, P. Lan; Hegyi, Hédi; Lussier, Yves; Briese, Thomas; Palacios, Gustavo; Lipkin, W. I. title: Comprehensive viral oligonucleotide probe design using conserved protein regions date: 2007-12-13 journal: Nucleic Acids Res DOI: 10.1093/nar/gkm1106 sha: doc_id: 325985 cord_uid: xfzhn1n1 file: cache/cord-279528-41atidai.json key: cord-279528-41atidai authors: Abo-Elkhier, Mervat M.; Abd Elwahaab, Marwa A.; Abo El Maaty, Moheb I. title: Measuring Similarity among Protein Sequences Using a New Descriptor date: 2019-11-22 journal: Biomed Res Int DOI: 10.1155/2019/2796971 sha: doc_id: 279528 cord_uid: 41atidai file: cache/cord-301827-a7hnuxy5.json key: cord-301827-a7hnuxy5 authors: Uversky, Vladimir N title: A decade and a half of protein intrinsic disorder: Biology still waits for physics date: 2013-04-29 journal: Protein Science DOI: 10.1002/pro.2261 sha: doc_id: 301827 cord_uid: a7hnuxy5 file: cache/cord-300149-djclli8n.json key: cord-300149-djclli8n authors: Ruan, Yijun; Wei, Chia Lin; Ling, Ai Ee; Vega, Vinsensius B; Thoreau, Herve; Se Thoe, Su Yun; Chia, Jer-Ming; Ng, Patrick; Chiu, Kuo Ping; Lim, Landri; Zhang, Tao; Chan, Kwai Peng; Lin Ean, Lynette Oon; Ng, Mah Lee; Leo, Sin Yee; Ng, Lisa FP; Ren, Ee Chee; Stanton, Lawrence W; Long, Philip M; Liu, Edison T title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection date: 2003-05-24 journal: Lancet DOI: 10.1016/s0140-6736(03)13414-9 sha: doc_id: 300149 cord_uid: djclli8n file: cache/cord-268467-btfz6ye8.json key: cord-268467-btfz6ye8 authors: Schreiber, Steven S.; Kamahora, Toshio; Lai, Michael M.C. title: Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E date: 1989-03-31 journal: Virology DOI: 10.1016/0042-6822(89)90050-0 sha: doc_id: 268467 cord_uid: btfz6ye8 file: cache/cord-287658-c2lljdi7.json key: cord-287658-c2lljdi7 authors: Lopez-Rincon, Alejandro; Tonda, Alberto; Mendoza-Maldonado, Lucero; Mulders, Daphne G.J.C.; Molenkamp, Richard; Perez-Romero, Carmina A.; Claassen, Eric; Garssen, Johan; Kraneveld, Aletta D. title: Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning date: 2020-09-10 journal: bioRxiv DOI: 10.1101/2020.03.13.990242 sha: doc_id: 287658 cord_uid: c2lljdi7 file: cache/cord-304869-l6a68tqn.json key: cord-304869-l6a68tqn authors: Bielińska-Wąż, Dorota title: Graphical and numerical representations of DNA sequences: statistical aspects of similarity date: 2011-08-28 journal: J Math Chem DOI: 10.1007/s10910-011-9890-8 sha: doc_id: 304869 cord_uid: l6a68tqn file: cache/cord-287634-64zqe4cz.json key: cord-287634-64zqe4cz authors: Al-Ssulami, Abdulrakeeb M.; Azmi, Aqil M.; Hussain, Muhammad title: CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents date: 2020-01-31 journal: Genomics DOI: 10.1016/j.ygeno.2019.02.002 sha: doc_id: 287634 cord_uid: 64zqe4cz file: cache/cord-324216-ce3wa889.json key: cord-324216-ce3wa889 authors: Wang, Zheng; Malanoski, Anthony P; Lin, Baochuan; Kidd, Carolyn; Long, Nina C; Blaney, Kate M; Thach, Dzung C; Tibbetts, Clark; Stenger, David A title: Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses date: 2008-12-01 journal: BMC Genomics DOI: 10.1186/1471-2164-9-577 sha: doc_id: 324216 cord_uid: ce3wa889 file: cache/cord-296691-cg463fbn.json key: cord-296691-cg463fbn authors: Wang, Ren; Xu, Sheng; Jiang, Yumei; Jiang, Jingwei; Li, Xiaodan; Liang, Lijian; He, Jia; Peng, Feng; Xia, Bing title: De novo Sequence Assembly and Characterization of Lycoris aurea Transcriptome Using GS FLX Titanium Platform of 454 Pyrosequencing date: 2013-04-09 journal: PLoS One DOI: 10.1371/journal.pone.0060449 sha: doc_id: 296691 cord_uid: cg463fbn file: cache/cord-302161-ytr7ds8i.json key: cord-302161-ytr7ds8i authors: Lutz, Mirjam; Steiner, Aline R.; Cattori, Valentino; Hofmann-Lehmann, Regina; Lutz, Hans; Kipar, Anja; Meli, Marina L. title: FCoV Viral Sequences of Systemically Infected Healthy Cats Lack Gene Mutations Previously Linked to the Development of FIP date: 2020-07-24 journal: Pathogens DOI: 10.3390/pathogens9080603 sha: doc_id: 302161 cord_uid: ytr7ds8i file: cache/cord-291156-zxg3dsm3.json key: cord-291156-zxg3dsm3 authors: Bernasconi, Anna; Canakoglu, Arif; Pinoli, Pietro; Ceri, Stefano title: Empowering Virus Sequences Research through Conceptual Modeling date: 2020-05-01 journal: bioRxiv DOI: 10.1101/2020.04.29.067637 sha: doc_id: 291156 cord_uid: zxg3dsm3 file: cache/cord-304607-td0776wj.json key: cord-304607-td0776wj authors: Paszkiewicz, Konrad H.; Giezen, Mark van der title: Omics, Bioinformatics, and Infectious Disease Research date: 2010-12-24 journal: Genetics and Evolution of Infectious Disease DOI: 10.1016/b978-0-12-384890-1.00018-2 sha: doc_id: 304607 cord_uid: td0776wj file: cache/cord-310734-6v7oru2l.json key: cord-310734-6v7oru2l authors: Bolatti, Elisa M.; Zorec, Tomaž M.; Montani, María E.; Hošnjak, Lea; Chouhy, Diego; Viarengo, Gastón; Casal, Pablo E.; Barquez, Rubén M.; Poljak, Mario; Giri, Adriana A. title: A Preliminary Study of the Virome of the South American Free-Tailed Bats (Tadarida brasiliensis) and Identification of Two Novel Mammalian Viruses date: 2020-04-09 journal: Viruses DOI: 10.3390/v12040422 sha: doc_id: 310734 cord_uid: 6v7oru2l file: cache/cord-023209-un2ysc2v.json key: cord-023209-un2ysc2v authors: nan title: Poster Presentations date: 2008-10-07 journal: J Pept Sci DOI: 10.1002/psc.1090 sha: doc_id: 23209 cord_uid: un2ysc2v file: cache/cord-325043-vqjhiv7p.json key: cord-325043-vqjhiv7p authors: Gorbalenya, Alexander E.; Blinov, Vladimir M.; Donchenko, Alexei P.; Koonin, Eugene V. title: An NTP-binding motif is the most conserved sequence in a highly diverged monophyletic group of proteins involved in positive strand RNA viral replication date: 1989 journal: J Mol Evol DOI: 10.1007/bf02102483 sha: doc_id: 325043 cord_uid: vqjhiv7p file: cache/cord-004879-pgyzluwp.json key: cord-004879-pgyzluwp authors: nan title: Programmed cell death date: 1994 journal: Experientia DOI: 10.1007/bf02033112 sha: doc_id: 4879 cord_uid: pgyzluwp file: cache/cord-325750-x7jpsnxg.json key: cord-325750-x7jpsnxg authors: Mokili, John L; Rohwer, Forest; Dutilh, Bas E title: Metagenomics and future perspectives in virus discovery date: 2012-01-20 journal: Curr Opin Virol DOI: 10.1016/j.coviro.2011.12.004 sha: doc_id: 325750 cord_uid: x7jpsnxg file: cache/cord-324021-y1vr1db0.json key: cord-324021-y1vr1db0 authors: Kozak, M. title: Determinants of translational fidelity and efficiency in vertebrate mRNAs date: 1994-12-31 journal: Biochimie DOI: 10.1016/0300-9084(94)90182-1 sha: doc_id: 324021 cord_uid: y1vr1db0 file: cache/cord-001835-0s7ok4uw.json key: cord-001835-0s7ok4uw authors: nan title: Abstracts of the 29th Annual Symposium of The Protein Society date: 2015-10-01 journal: Protein Science DOI: 10.1002/pro.2823 sha: doc_id: 1835 cord_uid: 0s7ok4uw file: cache/cord-326225-crtpzad7.json key: cord-326225-crtpzad7 authors: Neill, John D.; Bayles, Darrell O.; Ridpath, Julia F. title: Simultaneous rapid sequencing of multiple RNA virus genomes date: 2014-06-01 journal: J Virol Methods DOI: 10.1016/j.jviromet.2014.02.016 sha: doc_id: 326225 cord_uid: crtpzad7 file: cache/cord-328644-odtue60a.json key: cord-328644-odtue60a authors: Comandatore, Francesco; Chiodi, Alice; Gabrieli, Paolo; Biffignandi, Gherard Batisti; Perini, Matteo; Ricagno, Stefano; Mascolo, Elia; Petazzoni, Greta; Ramazzotti, Matteo; Rimoldi, Sara Giordana; Gismondo, Maria Rita; Micheli, Valeria; Sassera, Davide; Gaiarsa, Stefano; Bandi, Claudio; Brilli, Matteo title: Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes date: 2020-05-28 journal: bioRxiv DOI: 10.1101/2020.04.30.071027 sha: doc_id: 328644 cord_uid: odtue60a file: cache/cord-334394-qgyzk7th.json key: cord-334394-qgyzk7th authors: Edgar, Robert C.; Taylor, Jeff; Altman, Tomer; Barbera, Pierre; Meleshko, Dmitry; Lin, Victor; Lohr, Dan; Novakovsky, Gherman; Al-Shayeb, Basem; Banfield, Jillian F.; Korobeynikov, Anton; Chikhi, Rayan; Babaian, Artem title: Petabase-scale sequence alignment catalyses viral discovery date: 2020-08-10 journal: bioRxiv DOI: 10.1101/2020.08.07.241729 sha: doc_id: 334394 cord_uid: qgyzk7th file: cache/cord-331698-rwow1ydx.json key: cord-331698-rwow1ydx authors: Latorre-Pérez, Adriel; Pascual, Javier; Porcar, Manuel; Vilanova, Cristina title: A lab in the field: applications of real-time, in situ metagenomic sequencing date: 2020-08-20 journal: Biol Methods Protoc DOI: 10.1093/biomethods/bpaa016 sha: doc_id: 331698 cord_uid: rwow1ydx file: cache/cord-330067-ujhgb3b0.json key: cord-330067-ujhgb3b0 authors: Huang, Yi; Lau, Susanna K. P.; Woo, Patrick C. Y.; Yuen, Kwok-yung title: CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes date: 2007-10-02 journal: Nucleic Acids Res DOI: 10.1093/nar/gkm754 sha: doc_id: 330067 cord_uid: ujhgb3b0 file: cache/cord-338207-60vrlrim.json key: cord-338207-60vrlrim authors: Lefkowitz, E.J.; Odom, M.R.; Upton, C. title: Virus Databases date: 2008-07-30 journal: Encyclopedia of Virology DOI: 10.1016/b978-012374410-4.00719-6 sha: doc_id: 338207 cord_uid: 60vrlrim file: cache/cord-339209-oe8onyr9.json key: cord-339209-oe8onyr9 authors: Vasilakis, Nikos; Guzman, Hilda; Firth, Cadhla; Forrester, Naomi L; Widen, Steven G; Wood, Thomas G; Rossi, Shannan L; Ghedin, Elodie; Popov, Vsevolov; Blasdell, Kim R; Walker, Peter J; Tesh, Robert B title: Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range date: 2014-05-20 journal: Virol J DOI: 10.1186/1743-422x-11-97 sha: doc_id: 339209 cord_uid: oe8onyr9 file: cache/cord-334127-wjf8t8vp.json key: cord-334127-wjf8t8vp authors: Brister, J. Rodney; Ako-adjei, Danso; Bao, Yiming; Blinkova, Olga title: NCBI Viral Genomes Resource date: 2015-01-28 journal: Nucleic Acids Res DOI: 10.1093/nar/gku1207 sha: doc_id: 334127 cord_uid: wjf8t8vp file: cache/cord-348427-worgd0xu.json key: cord-348427-worgd0xu authors: Hatcher, Eneida L.; Zhdanov, Sergey A.; Bao, Yiming; Blinkova, Olga; Nawrocki, Eric P.; Ostapchuck, Yuri; Schäffer, Alejandro A.; Brister, J. Rodney title: Virus Variation Resource – improved response to emergent viral outbreaks date: 2017-01-04 journal: Nucleic Acids Res DOI: 10.1093/nar/gkw1065 sha: doc_id: 348427 cord_uid: worgd0xu file: cache/cord-340907-j9i1wlak.json key: cord-340907-j9i1wlak authors: Zarai, Yoram; Zafrir, Zohar; Siridechadilok, Bunpote; Suphatrakul, Amporn; Roopin, Modi; Julander, Justin; Tuller, Tamir title: Evolutionary selection against short nucleotide sequences in viruses and their related hosts date: 2020-04-27 journal: DNA Res DOI: 10.1093/dnares/dsaa008 sha: doc_id: 340907 cord_uid: j9i1wlak file: cache/cord-341564-fvuwick5.json key: cord-341564-fvuwick5 authors: Qi, Zhao-Hui; Li, Ke-Cheng; Ma, Jin-Long; Yao, Yu-Hua; Liu, Ling-Yun title: Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application date: 2018-06-12 journal: Evol Bioinform Online DOI: 10.1177/1176934318777755 sha: doc_id: 341564 cord_uid: fvuwick5 file: cache/cord-345552-h6fwi0qn.json key: cord-345552-h6fwi0qn authors: Li, Q.-G.; Lindman, K.; Wadell, G. title: Hydropathic characteristics of adenovirus hexons date: 1997-07-01 journal: Arch Virol DOI: 10.1007/s007050050162 sha: doc_id: 345552 cord_uid: h6fwi0qn file: cache/cord-328259-3g4klpyg.json key: cord-328259-3g4klpyg authors: Guajardo-Leiva, Sergio; Chnaiderman, Jonás; Gaggero, Aldo; Díez, Beatriz title: Metagenomic Insights into the Sewage RNA Virosphere of a Large City date: 2020-09-21 journal: Viruses DOI: 10.3390/v12091050 sha: doc_id: 328259 cord_uid: 3g4klpyg file: cache/cord-330312-1pjolkql.json key: cord-330312-1pjolkql authors: Liu, Y.-T. title: Infectious Disease Genomics date: 2017-01-20 journal: Genetics and Evolution of Infectious Diseases DOI: 10.1016/b978-0-12-799942-5.00010-x sha: doc_id: 330312 cord_uid: 1pjolkql file: cache/cord-354465-5nqrrnqr.json key: cord-354465-5nqrrnqr authors: Haslinger, Christian; Stadler, Peter F. title: RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties date: 1999 journal: Bull Math Biol DOI: 10.1006/bulm.1998.0085 sha: doc_id: 354465 cord_uid: 5nqrrnqr file: cache/cord-342785-55r01n0x.json key: cord-342785-55r01n0x authors: Lemmon, Gordon H; Gardner, Shea N title: Predicting the sensitivity and specificity of published real-time PCR assays date: 2008-09-25 journal: Ann Clin Microbiol Antimicrob DOI: 10.1186/1476-0711-7-18 sha: doc_id: 342785 cord_uid: 55r01n0x file: cache/cord-344782-ond1ziu5.json key: cord-344782-ond1ziu5 authors: Zhang, Jing; Finlaison, Deborah S.; Frost, Melinda J.; Gestier, Sarah; Gu, Xingnian; Hall, Jane; Jenkins, Cheryl; Parrish, Kate; Read, Andrew J.; Srivastava, Mukesh; Rose, Karrie; Kirkland, Peter D. title: Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi) date: 2018-10-24 journal: PLoS One DOI: 10.1371/journal.pone.0205209 sha: doc_id: 344782 cord_uid: ond1ziu5 file: cache/cord-339915-8j04y50s.json key: cord-339915-8j04y50s authors: Deng, Wei; Luan, Yihui title: DV-Curve Representation of Protein Sequences and Its Application date: 2014-05-08 journal: Comput Math Methods Med DOI: 10.1155/2014/203871 sha: doc_id: 339915 cord_uid: 8j04y50s file: cache/cord-355075-ieb35upi.json key: cord-355075-ieb35upi authors: Papenfuss, Anthony T; Baker, Michelle L; Feng, Zhi-Ping; Tachedjian, Mary; Crameri, Gary; Cowled, Chris; Ng, Justin; Janardhana, Vijaya; Field, Hume E; Wang, Lin-Fa title: The immune gene repertoire of an important viral reservoir, the Australian black flying fox date: 2012-06-20 journal: BMC Genomics DOI: 10.1186/1471-2164-13-261 sha: doc_id: 355075 cord_uid: ieb35upi file: cache/cord-353290-1wi1dhv6.json key: cord-353290-1wi1dhv6 authors: Kustin, Talia; Stern, Adi title: Biased mutation and selection in RNA viruses date: 2020-09-28 journal: Mol Biol Evol DOI: 10.1093/molbev/msaa247 sha: doc_id: 353290 cord_uid: 1wi1dhv6 file: cache/cord-343863-q1y8uscj.json key: cord-343863-q1y8uscj authors: Whitney, Joe; Esteban, David J; Upton, Chris title: Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches date: 2005-02-08 journal: BMC Bioinformatics DOI: 10.1186/1471-2105-6-23 sha: doc_id: 343863 cord_uid: q1y8uscj file: cache/cord-341879-vubszdp2.json key: cord-341879-vubszdp2 authors: Li, Lucy M; Grassly, Nicholas C; Fraser, Christophe title: Genomic analysis of emerging pathogens: methods, application and future trends date: 2014-11-22 journal: Genome Biol DOI: 10.1186/s13059-014-0541-9 sha: doc_id: 341879 cord_uid: vubszdp2 Reading metadata file and updating bibliogrpahics === updating bibliographic database Building study carrel named keyword-sequence-cord === file2bib.sh === Traceback (most recent call last): File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'cord-193356-hqbstgg7' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/data-disk/reader-compute/reader-cord/bin/file2bib.py", line 64, in if ( bibliographics.loc[ escape ,'author'] ) : author = bibliographics.loc[ escape,'author'] File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1762, in __getitem__ return self._getitem_tuple(key) File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1272, in _getitem_tuple return self._getitem_lowerdim(tup) File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1389, in _getitem_lowerdim section = self._getitem_axis(key, axis=i) File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1965, in _getitem_axis return self._get_label(key, axis=axis) File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 625, in _get_label return self.obj._xs(label, axis=axis) File "/data-disk/python/lib/python3.8/site-packages/pandas/core/generic.py", line 3537, in xs loc = self.index.get_loc(key) File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'cord-193356-hqbstgg7' === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 3304 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 2864 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 4166 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 4256 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 3003 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 3868 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 5108 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 3945 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 96783 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 4500 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 1696 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 4970 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 4489 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint Try decreasing the value of OMP_NUM_THREADS. /data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 96250 Aborted $FILE2BIB "$FILE" > "$OUTPUT" === file2bib.sh === id: cord-014674-ey29970v author: nan title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 date: 2003 pages: extension: .txt txt: ./txt/cord-014674-ey29970v.txt cache: ./cache/cord-014674-ey29970v.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-014674-ey29970v.txt' === file2bib.sh === id: cord-253436-dz84icdc author: Wille, Michelle title: High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl date: 2016-03-03 pages: extension: .txt txt: ./txt/cord-253436-dz84icdc.txt cache: ./cache/cord-253436-dz84icdc.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-253436-dz84icdc.txt' === file2bib.sh === id: cord-018459-isbc1r2o author: Munjal, Geetika title: Phylogenetics Algorithms and Applications date: 2018-12-10 pages: extension: .txt txt: ./txt/cord-018459-isbc1r2o.txt cache: ./cache/cord-018459-isbc1r2o.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-018459-isbc1r2o.txt' === file2bib.sh === id: cord-001786-ybd8hi8y author: Dutilh, Bas E title: Metagenomic ventures into outer sequence space date: 2014-12-15 pages: extension: .txt txt: ./txt/cord-001786-ybd8hi8y.txt cache: ./cache/cord-001786-ybd8hi8y.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-001786-ybd8hi8y.txt' === file2bib.sh === id: cord-012975-u87ol3fs author: Ogiwara, Atsushi title: Construction of a dictionary of sequence motifs that characterize groups of related proteins date: 1992-09-17 pages: extension: .txt txt: ./txt/cord-012975-u87ol3fs.txt cache: ./cache/cord-012975-u87ol3fs.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-012975-u87ol3fs.txt' === file2bib.sh === id: cord-001340-kqcx7lrq author: Ladner, Jason T. title: Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing date: 2014-06-17 pages: extension: .txt txt: ./txt/cord-001340-kqcx7lrq.txt cache: ./cache/cord-001340-kqcx7lrq.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-001340-kqcx7lrq.txt' === file2bib.sh === id: cord-255194-4i9fc0r7 author: Djikeng, Appolinaire title: Viral genome sequencing by random priming methods date: 2008-01-07 pages: extension: .txt txt: ./txt/cord-255194-4i9fc0r7.txt cache: ./cache/cord-255194-4i9fc0r7.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-255194-4i9fc0r7.txt' === file2bib.sh === id: cord-264135-s2u76pvk author: Patel, Amrutlal K. title: Complete genome sequence analysis of chicken astrovirus isolate from India date: 2016-12-23 pages: extension: .txt txt: ./txt/cord-264135-s2u76pvk.txt cache: ./cache/cord-264135-s2u76pvk.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-264135-s2u76pvk.txt' === file2bib.sh === id: cord-027316-echxuw74 author: Modarresi, Kourosh title: Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model date: 2020-05-22 pages: extension: .txt txt: ./txt/cord-027316-echxuw74.txt cache: ./cache/cord-027316-echxuw74.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-027316-echxuw74.txt' === file2bib.sh === id: cord-256278-jvfjf7aw author: Feng, Jie title: New method for comparing DNA primary sequences based on a discrimination measure date: 2010-10-21 pages: extension: .txt txt: ./txt/cord-256278-jvfjf7aw.txt cache: ./cache/cord-256278-jvfjf7aw.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-256278-jvfjf7aw.txt' === file2bib.sh === id: cord-005060-n901y2d4 author: ZHANG, Feiyun title: Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus date: 2001 pages: extension: .txt txt: ./txt/cord-005060-n901y2d4.txt cache: ./cache/cord-005060-n901y2d4.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-005060-n901y2d4.txt' === file2bib.sh === id: cord-010161-bcuec2fz author: Matson, David O. title: IV, 6. Calicivirus RNA recombination date: 2004-09-14 pages: extension: .txt txt: ./txt/cord-010161-bcuec2fz.txt cache: ./cache/cord-010161-bcuec2fz.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-010161-bcuec2fz.txt' === file2bib.sh === id: cord-025610-7vouj8pp author: Latif, Seemab title: Backward-Forward Sequence Generative Network for Multiple Lexical Constraints date: 2020-05-06 pages: extension: .txt txt: ./txt/cord-025610-7vouj8pp.txt cache: ./cache/cord-025610-7vouj8pp.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-025610-7vouj8pp.txt' === file2bib.sh === id: cord-001537-i34vmfpp author: Lima, Francisco Esmaile de Sales title: Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil date: 2015-02-17 pages: extension: .txt txt: ./txt/cord-001537-i34vmfpp.txt cache: ./cache/cord-001537-i34vmfpp.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-001537-i34vmfpp.txt' === file2bib.sh === id: cord-023647-dlqs8ay9 author: nan title: Sequences and topology date: 2003-03-21 pages: extension: .txt txt: ./txt/cord-023647-dlqs8ay9.txt cache: ./cache/cord-023647-dlqs8ay9.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-023647-dlqs8ay9.txt' === file2bib.sh === id: cord-268549-2lg8i9r1 author: Dai, Qi title: Sequence comparison via polar coordinates representation and curve tree date: 2012-01-07 pages: extension: .txt txt: ./txt/cord-268549-2lg8i9r1.txt cache: ./cache/cord-268549-2lg8i9r1.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-268549-2lg8i9r1.txt' === file2bib.sh === id: cord-306725-0vam15pt author: Li, Hao title: First detection and genomic characteristics of bovine torovirus in dairy calves in China date: 2020-05-09 pages: extension: .txt txt: ./txt/cord-306725-0vam15pt.txt cache: ./cache/cord-306725-0vam15pt.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-306725-0vam15pt.txt' === file2bib.sh === id: cord-266794-oyppubq5 author: Zhang, Dachuan title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model date: 2020-09-01 pages: extension: .txt txt: ./txt/cord-266794-oyppubq5.txt cache: ./cache/cord-266794-oyppubq5.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-266794-oyppubq5.txt' === file2bib.sh === id: cord-000473-jpow6iw1 author: Astrovskaya, Irina title: Inferring viral quasispecies spectra from 454 pyrosequencing reads date: 2011-07-28 pages: extension: .txt txt: ./txt/cord-000473-jpow6iw1.txt cache: ./cache/cord-000473-jpow6iw1.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-000473-jpow6iw1.txt' === file2bib.sh === id: cord-000257-ampip7od author: Bagowski, Christoph P title: The Nature of Protein Domain Evolution: Shaping the Interaction Network date: 2010-08-17 pages: extension: .txt txt: ./txt/cord-000257-ampip7od.txt cache: ./cache/cord-000257-ampip7od.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-000257-ampip7od.txt' === file2bib.sh === id: cord-321150-ev6acl7b author: Lam, Ha Minh title: Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm date: 2017-10-03 pages: extension: .txt txt: ./txt/cord-321150-ev6acl7b.txt cache: ./cache/cord-321150-ev6acl7b.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-321150-ev6acl7b.txt' === file2bib.sh === id: cord-265857-fs6dj3dp author: Liu, Yu-Tsueng title: Infectious Disease Genomics date: 2010-12-24 pages: extension: .txt txt: ./txt/cord-265857-fs6dj3dp.txt cache: ./cache/cord-265857-fs6dj3dp.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-265857-fs6dj3dp.txt' === file2bib.sh === id: cord-017584-9rx4jlw8 author: Kim, Kwangsoo title: Selecting Genotyping Oligo Probes Via Logical Analysis of Data date: 2007 pages: extension: .txt txt: ./txt/cord-017584-9rx4jlw8.txt cache: ./cache/cord-017584-9rx4jlw8.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-017584-9rx4jlw8.txt' === file2bib.sh === id: cord-256608-ajzk86rq author: van Weezep, Erik title: PCR diagnostics: In silico validation by an automated tool using freely available software programs date: 2019-05-13 pages: extension: .txt txt: ./txt/cord-256608-ajzk86rq.txt cache: ./cache/cord-256608-ajzk86rq.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-256608-ajzk86rq.txt' === file2bib.sh === id: cord-203232-1nnqx1g9 author: Canturk, Semih title: Machine-Learning Driven Drug Repurposing for COVID-19 date: 2020-06-25 pages: extension: .txt txt: ./txt/cord-203232-1nnqx1g9.txt cache: ./cache/cord-203232-1nnqx1g9.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-203232-1nnqx1g9.txt' === file2bib.sh === id: cord-266288-buc4dd5y author: Dong, Rui title: A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance date: 2019-04-09 pages: extension: .txt txt: ./txt/cord-266288-buc4dd5y.txt cache: ./cache/cord-266288-buc4dd5y.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-266288-buc4dd5y.txt' === file2bib.sh === id: cord-004862-yv76yvy5 author: Demers, G. William title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin date: 1989 pages: extension: .txt txt: ./txt/cord-004862-yv76yvy5.txt cache: ./cache/cord-004862-yv76yvy5.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-004862-yv76yvy5.txt' === file2bib.sh === id: cord-255371-o9oxchq6 author: Nguyen, Thanh Thi title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) date: 2020-07-10 pages: extension: .txt txt: ./txt/cord-255371-o9oxchq6.txt cache: ./cache/cord-255371-o9oxchq6.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-255371-o9oxchq6.txt' === file2bib.sh === id: cord-266960-kyx6xhvj author: Temple, Mark D. title: Real-time audio and visual display of the Coronavirus genome date: 2020-10-02 pages: extension: .txt txt: ./txt/cord-266960-kyx6xhvj.txt cache: ./cache/cord-266960-kyx6xhvj.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-266960-kyx6xhvj.txt' === file2bib.sh === id: cord-002473-2kpxhzbe author: Das, Jayanta Kumar title: Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach date: 2017-03-31 pages: extension: .txt txt: ./txt/cord-002473-2kpxhzbe.txt cache: ./cache/cord-002473-2kpxhzbe.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-002473-2kpxhzbe.txt' === file2bib.sh === id: cord-311240-o0zyt2vb author: Motayo, Babatunde Olarenwaju title: Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences date: 2020-07-27 pages: extension: .txt txt: ./txt/cord-311240-o0zyt2vb.txt cache: ./cache/cord-311240-o0zyt2vb.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-311240-o0zyt2vb.txt' === file2bib.sh === id: cord-213136-euv6pqh5 author: Singh, Kulveer title: Sequence Effects on Internal Structure of Droplets of Associative Polymers date: 2020-05-17 pages: extension: .txt txt: ./txt/cord-213136-euv6pqh5.txt cache: ./cache/cord-213136-euv6pqh5.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-213136-euv6pqh5.txt' === file2bib.sh === id: cord-014461-2ubh9u8r author: Nelson, Oranmiyan W. title: Genome sequences published outside of Standards in Genomic Sciences, July - October 2012 date: 2012-10-10 pages: extension: .txt txt: ./txt/cord-014461-2ubh9u8r.txt cache: ./cache/cord-014461-2ubh9u8r.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-014461-2ubh9u8r.txt' === file2bib.sh === id: cord-264296-0x90yubt author: Sawmya, Shashata title: Analyzing hCov genome sequences: Applying Machine Intelligence and beyond date: 2020-06-03 pages: extension: .txt txt: ./txt/cord-264296-0x90yubt.txt cache: ./cache/cord-264296-0x90yubt.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-264296-0x90yubt.txt' === file2bib.sh === id: cord-010499-yefxrj30 author: Yelverton, Elizabeth title: The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli date: 2006-10-27 pages: extension: .txt txt: ./txt/cord-010499-yefxrj30.txt cache: ./cache/cord-010499-yefxrj30.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-010499-yefxrj30.txt' === file2bib.sh === id: cord-287634-64zqe4cz author: Al-Ssulami, Abdulrakeeb M. title: CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents date: 2020-01-31 pages: extension: .txt txt: ./txt/cord-287634-64zqe4cz.txt cache: ./cache/cord-287634-64zqe4cz.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-287634-64zqe4cz.txt' === file2bib.sh === id: cord-279528-41atidai author: Abo-Elkhier, Mervat M. title: Measuring Similarity among Protein Sequences Using a New Descriptor date: 2019-11-22 pages: extension: .txt txt: ./txt/cord-279528-41atidai.txt cache: ./cache/cord-279528-41atidai.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-279528-41atidai.txt' === file2bib.sh === id: cord-102766-n6mpdhyu author: Alam, Md. Nafis Ul title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses date: 2020-06-25 pages: extension: .txt txt: ./txt/cord-102766-n6mpdhyu.txt cache: ./cache/cord-102766-n6mpdhyu.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-102766-n6mpdhyu.txt' === file2bib.sh === id: cord-280881-5o38ihe0 author: Wlodawer, Alexander title: A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases date: 2003-11-11 pages: extension: .txt txt: ./txt/cord-280881-5o38ihe0.txt cache: ./cache/cord-280881-5o38ihe0.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-280881-5o38ihe0.txt' === file2bib.sh === id: cord-003316-r5te5xob author: Balloux, Francois title: From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic date: 2018-12-17 pages: extension: .txt txt: ./txt/cord-003316-r5te5xob.txt cache: ./cache/cord-003316-r5te5xob.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-003316-r5te5xob.txt' === file2bib.sh === id: cord-000642-mkwpuav6 author: Moreira, Rebeca title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing date: 2012-04-19 pages: extension: .txt txt: ./txt/cord-000642-mkwpuav6.txt cache: ./cache/cord-000642-mkwpuav6.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-000642-mkwpuav6.txt' === file2bib.sh === id: cord-287658-c2lljdi7 author: Lopez-Rincon, Alejandro title: Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning date: 2020-09-10 pages: extension: .txt txt: ./txt/cord-287658-c2lljdi7.txt cache: ./cache/cord-287658-c2lljdi7.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-287658-c2lljdi7.txt' === file2bib.sh === id: cord-302798-q0mbngqy author: Ge, Junwei title: Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China date: 2018-06-14 pages: extension: .txt txt: ./txt/cord-302798-q0mbngqy.txt cache: ./cache/cord-302798-q0mbngqy.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-302798-q0mbngqy.txt' === file2bib.sh === id: cord-321386-u1imic5l author: Li, Chun title: Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation date: 2018-02-17 pages: extension: .txt txt: ./txt/cord-321386-u1imic5l.txt cache: ./cache/cord-321386-u1imic5l.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-321386-u1imic5l.txt' === file2bib.sh === id: cord-274056-9t3kneoo author: Abd Elwahaab, Marwa A. title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector date: 2019-05-08 pages: extension: .txt txt: ./txt/cord-274056-9t3kneoo.txt cache: ./cache/cord-274056-9t3kneoo.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-274056-9t3kneoo.txt' === file2bib.sh === id: cord-193910-7p3f3znj author: Zhang, Xiangxie title: Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification date: 2020-11-01 pages: extension: .txt txt: ./txt/cord-193910-7p3f3znj.txt cache: ./cache/cord-193910-7p3f3znj.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-193910-7p3f3znj.txt' === file2bib.sh === id: cord-016798-tv2ntug6 author: Gautam, Ablesh title: Bioinformatics Applications in Advancing Animal Virus Research date: 2019-06-06 pages: extension: .txt txt: ./txt/cord-016798-tv2ntug6.txt cache: ./cache/cord-016798-tv2ntug6.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-016798-tv2ntug6.txt' === file2bib.sh === id: cord-252347-vnn4135b author: Lee, Wai-Ming title: A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants date: 2007-10-03 pages: extension: .txt txt: ./txt/cord-252347-vnn4135b.txt cache: ./cache/cord-252347-vnn4135b.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-252347-vnn4135b.txt' === file2bib.sh === id: cord-001974-wjf3c7a7 author: Friis-Nielsen, Jens title: Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers date: 2016-02-19 pages: extension: .txt txt: ./txt/cord-001974-wjf3c7a7.txt cache: ./cache/cord-001974-wjf3c7a7.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-001974-wjf3c7a7.txt' === file2bib.sh === id: cord-025948-6dsx7pey author: Maitra, Arindam title: Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility date: 2020-06-04 pages: extension: .txt txt: ./txt/cord-025948-6dsx7pey.txt cache: ./cache/cord-025948-6dsx7pey.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-025948-6dsx7pey.txt' === file2bib.sh === id: cord-268467-btfz6ye8 author: Schreiber, Steven S. title: Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E date: 1989-03-31 pages: extension: .txt txt: ./txt/cord-268467-btfz6ye8.txt cache: ./cache/cord-268467-btfz6ye8.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-268467-btfz6ye8.txt' === file2bib.sh === id: cord-300149-djclli8n author: Ruan, Yijun title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection date: 2003-05-24 pages: extension: .txt txt: ./txt/cord-300149-djclli8n.txt cache: ./cache/cord-300149-djclli8n.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-300149-djclli8n.txt' === file2bib.sh === id: cord-267500-x3u9i1vq author: Rose, Rebecca title: Challenges in the analysis of viral metagenomes date: 2016-08-03 pages: extension: .txt txt: ./txt/cord-267500-x3u9i1vq.txt cache: ./cache/cord-267500-x3u9i1vq.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-267500-x3u9i1vq.txt' === file2bib.sh === id: cord-035033-osjy88rc author: Aydin, Berkay title: Spatiotemporal event sequence discovery without thresholds date: 2020-11-09 pages: extension: .txt txt: ./txt/cord-035033-osjy88rc.txt cache: ./cache/cord-035033-osjy88rc.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-035033-osjy88rc.txt' === file2bib.sh === id: cord-324216-ce3wa889 author: Wang, Zheng title: Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses date: 2008-12-01 pages: extension: .txt txt: ./txt/cord-324216-ce3wa889.txt cache: ./cache/cord-324216-ce3wa889.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-324216-ce3wa889.txt' === file2bib.sh === id: cord-033010-o5kiadfm author: Durojaye, Olanrewaju Ayodeji title: Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study date: 2020-10-02 pages: extension: .txt txt: ./txt/cord-033010-o5kiadfm.txt cache: ./cache/cord-033010-o5kiadfm.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-033010-o5kiadfm.txt' === file2bib.sh === id: cord-300796-rmjv56ia author: nan title: The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation date: 1990-09-01 pages: extension: .txt txt: ./txt/cord-300796-rmjv56ia.txt cache: ./cache/cord-300796-rmjv56ia.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-300796-rmjv56ia.txt' === file2bib.sh === id: cord-015850-ef6svn8f author: Saitou, Naruya title: Eukaryote Genomes date: 2013-08-22 pages: extension: .txt txt: ./txt/cord-015850-ef6svn8f.txt cache: ./cache/cord-015850-ef6svn8f.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-015850-ef6svn8f.txt' === file2bib.sh === id: cord-325985-xfzhn1n1 author: Jabado, Omar J. title: Comprehensive viral oligonucleotide probe design using conserved protein regions date: 2007-12-13 pages: extension: .txt txt: ./txt/cord-325985-xfzhn1n1.txt cache: ./cache/cord-325985-xfzhn1n1.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-325985-xfzhn1n1.txt' === file2bib.sh === id: cord-275258-azpg5yrh author: Mead, Dylan J.T. title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling date: 2019-07-26 pages: extension: .txt txt: ./txt/cord-275258-azpg5yrh.txt cache: ./cache/cord-275258-azpg5yrh.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-275258-azpg5yrh.txt' === file2bib.sh === id: cord-263987-ff6kor0c author: Holmes, Ian H. title: Solving the master equation for Indels date: 2017-05-12 pages: extension: .txt txt: ./txt/cord-263987-ff6kor0c.txt cache: ./cache/cord-263987-ff6kor0c.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-263987-ff6kor0c.txt' === file2bib.sh === id: cord-022494-d66rz6dc author: Webb, B. title: Comparative Modeling of Drug Target Proteins date: 2014-10-01 pages: extension: .txt txt: ./txt/cord-022494-d66rz6dc.txt cache: ./cache/cord-022494-d66rz6dc.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-022494-d66rz6dc.txt' === file2bib.sh === id: cord-010273-0c56x9f5 author: Simmonds, Peter title: Virology of hepatitis C virus date: 2001-10-10 pages: extension: .txt txt: ./txt/cord-010273-0c56x9f5.txt cache: ./cache/cord-010273-0c56x9f5.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-010273-0c56x9f5.txt' === file2bib.sh === id: cord-103029-nc5yf6x4 author: Wichmann, Stefan title: Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank date: 2020-09-25 pages: extension: .txt txt: ./txt/cord-103029-nc5yf6x4.txt cache: ./cache/cord-103029-nc5yf6x4.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-103029-nc5yf6x4.txt' === file2bib.sh === id: cord-017932-vmtjc8ct author: Georgiev, Vassil St. title: Genomic and Postgenomic Research date: 2009 pages: extension: .txt txt: ./txt/cord-017932-vmtjc8ct.txt cache: ./cache/cord-017932-vmtjc8ct.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-017932-vmtjc8ct.txt' === file2bib.sh === id: cord-018133-2otxft31 author: Altman, Russ B. title: Bioinformatics date: 2006 pages: extension: .txt txt: ./txt/cord-018133-2otxft31.txt cache: ./cache/cord-018133-2otxft31.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-018133-2otxft31.txt' === file2bib.sh === id: cord-321715-bkfkmtld author: Redelings, Benjamin D title: Incorporating indel information into phylogeny estimation for rapidly emerging pathogens date: 2007-03-14 pages: extension: .txt txt: ./txt/cord-321715-bkfkmtld.txt cache: ./cache/cord-321715-bkfkmtld.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-321715-bkfkmtld.txt' === file2bib.sh === id: cord-311839-61djk4bs author: Wei, Dan title: A novel hierarchical clustering algorithm for gene sequences date: 2012-07-23 pages: extension: .txt txt: ./txt/cord-311839-61djk4bs.txt cache: ./cache/cord-311839-61djk4bs.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-311839-61djk4bs.txt' === file2bib.sh === id: cord-326225-crtpzad7 author: Neill, John D. title: Simultaneous rapid sequencing of multiple RNA virus genomes date: 2014-06-01 pages: extension: .txt txt: ./txt/cord-326225-crtpzad7.txt cache: ./cache/cord-326225-crtpzad7.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-326225-crtpzad7.txt' === file2bib.sh === id: cord-345552-h6fwi0qn author: Li, Q.-G. title: Hydropathic characteristics of adenovirus hexons date: 1997-07-01 pages: extension: .txt txt: ./txt/cord-345552-h6fwi0qn.txt cache: ./cache/cord-345552-h6fwi0qn.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-345552-h6fwi0qn.txt' === file2bib.sh === /data-disk/reader-compute/reader-cord/bin/file2bib.sh: fork: retry: No child processes id: cord-330067-ujhgb3b0 author: Huang, Yi title: CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes date: 2007-10-02 pages: extension: .txt txt: ./txt/cord-330067-ujhgb3b0.txt cache: ./cache/cord-330067-ujhgb3b0.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-330067-ujhgb3b0.txt' === file2bib.sh === id: cord-264746-gfn312aa author: Muse, Spencer title: GENOMICS AND BIOINFORMATICS date: 2012-03-29 pages: extension: .txt txt: ./txt/cord-264746-gfn312aa.txt cache: ./cache/cord-264746-gfn312aa.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-264746-gfn312aa.txt' === file2bib.sh === id: cord-341564-fvuwick5 author: Qi, Zhao-Hui title: Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application date: 2018-06-12 pages: extension: .txt txt: ./txt/cord-341564-fvuwick5.txt cache: ./cache/cord-341564-fvuwick5.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-341564-fvuwick5.txt' === file2bib.sh === id: cord-334127-wjf8t8vp author: Brister, J. Rodney title: NCBI Viral Genomes Resource date: 2015-01-28 pages: extension: .txt txt: ./txt/cord-334127-wjf8t8vp.txt cache: ./cache/cord-334127-wjf8t8vp.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-334127-wjf8t8vp.txt' === file2bib.sh === id: cord-339209-oe8onyr9 author: Vasilakis, Nikos title: Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range date: 2014-05-20 pages: extension: .txt txt: ./txt/cord-339209-oe8onyr9.txt cache: ./cache/cord-339209-oe8onyr9.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-339209-oe8onyr9.txt' === file2bib.sh === id: cord-022348-w7z97wir author: Sola, Monica title: Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing? date: 2007-09-02 pages: extension: .txt txt: ./txt/cord-022348-w7z97wir.txt cache: ./cache/cord-022348-w7z97wir.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-022348-w7z97wir.txt' === file2bib.sh === id: cord-342785-55r01n0x author: Lemmon, Gordon H title: Predicting the sensitivity and specificity of published real-time PCR assays date: 2008-09-25 pages: extension: .txt txt: ./txt/cord-342785-55r01n0x.txt cache: ./cache/cord-342785-55r01n0x.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-342785-55r01n0x.txt' === file2bib.sh === id: cord-016594-lj0us1dq author: Flower, Darren R. title: Identification of Candidate Vaccine Antigens In Silico date: 2012-09-28 pages: extension: .txt txt: ./txt/cord-016594-lj0us1dq.txt cache: ./cache/cord-016594-lj0us1dq.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-016594-lj0us1dq.txt' === file2bib.sh === id: cord-348427-worgd0xu author: Hatcher, Eneida L. title: Virus Variation Resource – improved response to emergent viral outbreaks date: 2017-01-04 pages: extension: .txt txt: ./txt/cord-348427-worgd0xu.txt cache: ./cache/cord-348427-worgd0xu.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-348427-worgd0xu.txt' === file2bib.sh === id: cord-017354-cndb031c author: Janies, D. title: Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases date: 2008 pages: extension: .txt txt: ./txt/cord-017354-cndb031c.txt cache: ./cache/cord-017354-cndb031c.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-017354-cndb031c.txt' === file2bib.sh === id: cord-339915-8j04y50s author: Deng, Wei title: DV-Curve Representation of Protein Sequences and Its Application date: 2014-05-08 pages: extension: .txt txt: ./txt/cord-339915-8j04y50s.txt cache: ./cache/cord-339915-8j04y50s.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-339915-8j04y50s.txt' === file2bib.sh === id: cord-328644-odtue60a author: Comandatore, Francesco title: Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes date: 2020-05-28 pages: extension: .txt txt: ./txt/cord-328644-odtue60a.txt cache: ./cache/cord-328644-odtue60a.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-328644-odtue60a.txt' === file2bib.sh === id: cord-331698-rwow1ydx author: Latorre-Pérez, Adriel title: A lab in the field: applications of real-time, in situ metagenomic sequencing date: 2020-08-20 pages: extension: .txt txt: ./txt/cord-331698-rwow1ydx.txt cache: ./cache/cord-331698-rwow1ydx.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-331698-rwow1ydx.txt' === file2bib.sh === id: cord-330312-1pjolkql author: Liu, Y.-T. title: Infectious Disease Genomics date: 2017-01-20 pages: extension: .txt txt: ./txt/cord-330312-1pjolkql.txt cache: ./cache/cord-330312-1pjolkql.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-330312-1pjolkql.txt' === file2bib.sh === id: cord-341879-vubszdp2 author: Li, Lucy M title: Genomic analysis of emerging pathogens: methods, application and future trends date: 2014-11-22 pages: extension: .txt txt: ./txt/cord-341879-vubszdp2.txt cache: ./cache/cord-341879-vubszdp2.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-341879-vubszdp2.txt' === file2bib.sh === /data-disk/reader-compute/reader-cord/bin/file2bib.sh: fork: retry: No child processes id: cord-338207-60vrlrim author: Lefkowitz, E.J. title: Virus Databases date: 2008-07-30 pages: extension: .txt txt: ./txt/cord-338207-60vrlrim.txt cache: ./cache/cord-338207-60vrlrim.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-338207-60vrlrim.txt' === file2bib.sh === id: cord-343863-q1y8uscj author: Whitney, Joe title: Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches date: 2005-02-08 pages: extension: .txt txt: ./txt/cord-343863-q1y8uscj.txt cache: ./cache/cord-343863-q1y8uscj.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-343863-q1y8uscj.txt' === file2bib.sh === id: cord-011565-8ncgldaq author: Elworth, R A Leo title: To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics date: 2020-06-04 pages: extension: .txt txt: ./txt/cord-011565-8ncgldaq.txt cache: ./cache/cord-011565-8ncgldaq.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-011565-8ncgldaq.txt' === file2bib.sh === id: cord-328259-3g4klpyg author: Guajardo-Leiva, Sergio title: Metagenomic Insights into the Sewage RNA Virosphere of a Large City date: 2020-09-21 pages: extension: .txt txt: ./txt/cord-328259-3g4klpyg.txt cache: ./cache/cord-328259-3g4klpyg.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-328259-3g4klpyg.txt' === file2bib.sh === id: cord-340907-j9i1wlak author: Zarai, Yoram title: Evolutionary selection against short nucleotide sequences in viruses and their related hosts date: 2020-04-27 pages: extension: .txt txt: ./txt/cord-340907-j9i1wlak.txt cache: ./cache/cord-340907-j9i1wlak.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-340907-j9i1wlak.txt' === file2bib.sh === id: cord-103297-4stnx8dw author: Widrich, Michael title: Modern Hopfield Networks and Attention for Immune Repertoire Classification date: 2020-08-17 pages: extension: .txt txt: ./txt/cord-103297-4stnx8dw.txt cache: ./cache/cord-103297-4stnx8dw.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-103297-4stnx8dw.txt' === file2bib.sh === id: cord-334394-qgyzk7th author: Edgar, Robert C. title: Petabase-scale sequence alignment catalyses viral discovery date: 2020-08-10 pages: extension: .txt txt: ./txt/cord-334394-qgyzk7th.txt cache: ./cache/cord-334394-qgyzk7th.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 5 resourceName b'cord-334394-qgyzk7th.txt' === file2bib.sh === id: cord-018963-2lia97db author: Xu, Ying title: Protein Structure Prediction by Protein Threading date: 2010-04-29 pages: extension: .txt txt: ./txt/cord-018963-2lia97db.txt cache: ./cache/cord-018963-2lia97db.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-018963-2lia97db.txt' === file2bib.sh === id: cord-344782-ond1ziu5 author: Zhang, Jing title: Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi) date: 2018-10-24 pages: extension: .txt txt: ./txt/cord-344782-ond1ziu5.txt cache: ./cache/cord-344782-ond1ziu5.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-344782-ond1ziu5.txt' === file2bib.sh === id: cord-353290-1wi1dhv6 author: Kustin, Talia title: Biased mutation and selection in RNA viruses date: 2020-09-28 pages: extension: .txt txt: ./txt/cord-353290-1wi1dhv6.txt cache: ./cache/cord-353290-1wi1dhv6.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 2 resourceName b'cord-353290-1wi1dhv6.txt' === file2bib.sh === id: cord-355075-ieb35upi author: Papenfuss, Anthony T title: The immune gene repertoire of an important viral reservoir, the Australian black flying fox date: 2012-06-20 pages: extension: .txt txt: ./txt/cord-355075-ieb35upi.txt cache: ./cache/cord-355075-ieb35upi.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-355075-ieb35upi.txt' === file2bib.sh === id: cord-354465-5nqrrnqr author: Haslinger, Christian title: RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties date: 1999 pages: extension: .txt txt: ./txt/cord-354465-5nqrrnqr.txt cache: ./cache/cord-354465-5nqrrnqr.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-354465-5nqrrnqr.txt' === file2bib.sh === id: cord-304869-l6a68tqn author: Bielińska-Wąż, Dorota title: Graphical and numerical representations of DNA sequences: statistical aspects of similarity date: 2011-08-28 pages: extension: .txt txt: ./txt/cord-304869-l6a68tqn.txt cache: ./cache/cord-304869-l6a68tqn.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-304869-l6a68tqn.txt' === file2bib.sh === id: cord-301827-a7hnuxy5 author: Uversky, Vladimir N title: A decade and a half of protein intrinsic disorder: Biology still waits for physics date: 2013-04-29 pages: extension: .txt txt: ./txt/cord-301827-a7hnuxy5.txt cache: ./cache/cord-301827-a7hnuxy5.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 3 resourceName b'cord-301827-a7hnuxy5.txt' === file2bib.sh === id: cord-014462-11ggaqf1 author: nan title: Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh date: 2011-04-21 pages: extension: .txt txt: ./txt/cord-014462-11ggaqf1.txt cache: ./cache/cord-014462-11ggaqf1.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 5 resourceName b'cord-014462-11ggaqf1.txt' === file2bib.sh === id: cord-023208-w99gc5nx author: nan title: Poster Presentation Abstracts date: 2006-09-01 pages: extension: .txt txt: ./txt/cord-023208-w99gc5nx.txt cache: ./cache/cord-023208-w99gc5nx.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 6 resourceName b'cord-023208-w99gc5nx.txt' === file2bib.sh === id: cord-004879-pgyzluwp author: nan title: Programmed cell death date: 1994 pages: extension: .txt txt: ./txt/cord-004879-pgyzluwp.txt cache: ./cache/cord-004879-pgyzluwp.txt Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 4 resourceName b'cord-004879-pgyzluwp.txt' === file2bib.sh === id: cord-023209-un2ysc2v author: nan title: Poster Presentations date: 2008-10-07 pages: extension: .txt txt: ./txt/cord-023209-un2ysc2v.txt cache: ./cache/cord-023209-un2ysc2v.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 8 resourceName b'cord-023209-un2ysc2v.txt' === file2bib.sh === id: cord-001835-0s7ok4uw author: nan title: Abstracts of the 29th Annual Symposium of The Protein Society date: 2015-10-01 pages: extension: .txt txt: ./txt/cord-001835-0s7ok4uw.txt cache: ./cache/cord-001835-0s7ok4uw.txt Content-Encoding UTF-8 Content-Type text/plain; charset=UTF-8 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 10 resourceName b'cord-001835-0s7ok4uw.txt' Que is empty; done keyword-sequence-cord === reduce.pl bib === id = cord-000257-ampip7od author = Bagowski, Christoph P title = The Nature of Protein Domain Evolution: Shaping the Interaction Network date = 2010-08-17 pages = extension = .txt mime = text/plain words = 4678 sentences = 249 flesch = 43 summary = With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. This likely stems from the fact that they are required to participate in many different interactions, which makes selection pressures more stringent and the appearance of the branches on phylogenetic trees relatively short and more difficult to assess when co-evolutionary data in terms of other domains in the same gene family or expression patterns is limited [42, 63] . This approach thus primarily focuses on the similarity and differences of the orthologous genes within network, and is therefore ideally suited for the study of protein domain evolution and has already revealed that species-specific parts Fig. cache = ./cache/cord-000257-ampip7od.txt txt = ./txt/cord-000257-ampip7od.txt === reduce.pl bib === === reduce.pl bib === id = cord-016798-tv2ntug6 author = Gautam, Ablesh title = Bioinformatics Applications in Advancing Animal Virus Research date = 2019-06-06 pages = extension = .txt mime = text/plain words = 6978 sentences = 405 flesch = 44 summary = The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al. cache = ./cache/cord-016798-tv2ntug6.txt txt = ./txt/cord-016798-tv2ntug6.txt === reduce.pl bib === id = cord-000473-jpow6iw1 author = Astrovskaya, Irina title = Inferring viral quasispecies spectra from 454 pyrosequencing reads date = 2011-07-28 pages = extension = .txt mime = text/plain words = 5363 sentences = 296 flesch = 54 summary = High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Given a collection of 454 pyrosequencing reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population. cache = ./cache/cord-000473-jpow6iw1.txt txt = ./txt/cord-000473-jpow6iw1.txt === reduce.pl bib === id = cord-025610-7vouj8pp author = Latif, Seemab title = Backward-Forward Sequence Generative Network for Multiple Lexical Constraints date = 2020-05-06 pages = extension = .txt mime = text/plain words = 3923 sentences = 230 flesch = 50 summary = In this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. Recently, Recurrent Neural Networks (RNNs) and their variants such as Long Short Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs) based language models have shown promising results in generating high quality text sequences, especially when the input and output are of variable length. first proposed multiple variants of Backward and Forward (B/F) language models based on GRUs for constrained sentence generation [13] . Therefore, we have proposed a neural probabilistic Backward-Forward architecture that can generate high quality sequences, with word embedding substitution method to satisfy multiple constraints. In this paper, we have proposed a novel method, dubbed Neural Probabilistic Backward-Forward language model and word embedding substitution method to address the issue of lexical constrained sequence generation. cache = ./cache/cord-025610-7vouj8pp.txt txt = ./txt/cord-025610-7vouj8pp.txt === reduce.pl bib === id = cord-004862-yv76yvy5 author = Demers, G. William title = The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin date = 1989 pages = extension = .txt mime = text/plain words = 6659 sentences = 347 flesch = 62 summary = title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. However, the region between the two ORFs is not conserved among species, and this observation is used to indicate possible start and stop codons for the ORFs. ORF-1 encodes a composite protein, and the 5' half of ORF-1 from L1Oc is related to type II cytoskeletal keratin. The dot-plot analyses in Fig. 6 show that the internal sequence of L1Oc is very similar to both L1Md (mouse) and L1Hs (human) over very long segments, whereas the 5' and 3' ends are not conserved between species. cache = ./cache/cord-004862-yv76yvy5.txt txt = ./txt/cord-004862-yv76yvy5.txt === reduce.pl bib === id = cord-025948-6dsx7pey author = Maitra, Arindam title = Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility date = 2020-06-04 pages = extension = .txt mime = text/plain words = 7218 sentences = 382 flesch = 56 summary = Direct massively parallel sequencing of SARS-CoV-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in Eastern India. We have initiated a study on sequencing of SARS-CoV-2 genome from swab samples obtained from infected individuals from different regions of West Bengal in Eastern India and report here the first nine sequences and the results of analysis of the sequence data with respect to other sequences reported from the country until date. The A2a clade is characterized by the signature nonsynonymous mutations leading to amino acid changes of P323L in the RdRp which is involved in replication of the viral genome and the change of D614G in the Spike glycoprotein which is essential for the entry of the virus in the host cell by binding to the ACE2 receptor. We have also detected emergence of mutations in the important regions of the viral genome including Spike, RdRP and nucleocapsid coding genes. cache = ./cache/cord-025948-6dsx7pey.txt txt = ./txt/cord-025948-6dsx7pey.txt === reduce.pl bib === id = cord-014674-ey29970v author = nan title = Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 date = 2003 pages = extension = .txt mime = text/plain words = 2522 sentences = 181 flesch = 62 summary = title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 We have closely examined the experimental data and the analyses of the nucleotide sequences presented in the report.We find that aside from problematic details of the experimental design and some erratic presentations of the data the results of the study do not provide evidence for the introgression of recombinant DNA from transgenic crop plants into the genomes of 'criollo' maize. 3. We characterized with the help of BLAST searches those parts of the sequences of the iPCR amplification products that were denoted by Quist and Chapela in their Fig.2 as regions flanking the CMV p-35S sequence.We find that the sequence of AF434754 denoted adh1 in the K1 source of Fig. 2 does not match with the maize adh1 gene. We examined whether the identified regions in the maize genomic DNA from which PCR amplification products were obtained by the authors would perhaps be flanked by primer binding sites. cache = ./cache/cord-014674-ey29970v.txt txt = ./txt/cord-014674-ey29970v.txt === reduce.pl bib === id = cord-015850-ef6svn8f author = Saitou, Naruya title = Eukaryote Genomes date = 2013-08-22 pages = extension = .txt mime = text/plain words = 7424 sentences = 484 flesch = 53 summary = General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] . cache = ./cache/cord-015850-ef6svn8f.txt txt = ./txt/cord-015850-ef6svn8f.txt === reduce.pl bib === id = cord-018459-isbc1r2o author = Munjal, Geetika title = Phylogenetics Algorithms and Applications date = 2018-12-10 pages = extension = .txt mime = text/plain words = 1851 sentences = 122 flesch = 42 summary = This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences. These alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15] . Constructing phylogenetic trees using multiple sequence alignment cache = ./cache/cord-018459-isbc1r2o.txt txt = ./txt/cord-018459-isbc1r2o.txt === reduce.pl bib === id = cord-033010-o5kiadfm author = Durojaye, Olanrewaju Ayodeji title = Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study date = 2020-10-02 pages = extension = .txt mime = text/plain words = 8125 sentences = 375 flesch = 53 summary = RESULTS: This study describes the detailed computational process by which the 2019-nCoV main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3D structure. Our current study took advantage of the availability of the SARS CoV main proteinase amino acid sequence to map out the nucleotide coding region for the same protein in the 2019-nCoV. The predicted secondary structure composition shows a high degree of alpha helix and beta sheets, respectively, occupying 45 and 47% of the total residues with the percentage loop occupancy at 8% regarded as comparative modeling, constructs atomic models based on known structures or structures that have been determined experimentally and likewise share more than 40% sequence homology. cache = ./cache/cord-033010-o5kiadfm.txt txt = ./txt/cord-033010-o5kiadfm.txt === reduce.pl bib === id = cord-012975-u87ol3fs author = Ogiwara, Atsushi title = Construction of a dictionary of sequence motifs that characterize groups of related proteins date = 1992-09-17 pages = extension = .txt mime = text/plain words = 3112 sentences = 165 flesch = 55 summary = An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. The conserved amino acid patterns, often called consensus patterns or sequence motifs (Taylor, 1988; Hodgman, 1989) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. This procedure is applied to the superfamily grouping of the PIR database and a library of sequence motifs is constructed that identifies specific superfamilies. Functional groups of proteins Suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. Because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites. cache = ./cache/cord-012975-u87ol3fs.txt txt = ./txt/cord-012975-u87ol3fs.txt === reduce.pl bib === id = cord-103029-nc5yf6x4 author = Wichmann, Stefan title = Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank date = 2020-09-25 pages = extension = .txt mime = text/plain words = 8665 sentences = 387 flesch = 52 summary = In this study the artificially designed sequences are compared to their original sequences in terms of amino acid identity, amino acid similarity, Hidden Markov Model profile and secondary structure in order to determine the impact of OLG construction and which sequences are potentially functional. While the previous study [30] tried to estimate an upper limit of how many domains can be successfully overlapped in at least one reading frame and position, here the average success rate for OLG construction is determined instead, which is more relevant in relation to both understanding constraints on the formation rate of naturally occuring OLGs and in assessing the likelihood of successful synthetic creation of OLGs. These results in one sense give an upper estimate of the ease of creating overlaps as the difficulty of obtaining an overlapping gene pair naturally is not directly addressed here. cache = ./cache/cord-103029-nc5yf6x4.txt txt = ./txt/cord-103029-nc5yf6x4.txt === reduce.pl bib === id = cord-256608-ajzk86rq author = van Weezep, Erik title = PCR diagnostics: In silico validation by an automated tool using freely available software programs date = 2019-05-13 pages = extension = .txt mime = text/plain words = 4950 sentences = 258 flesch = 54 summary = An alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the PCR test as search queries and the program SSEARCH available in the FASTA sequence analysis package (Brenner et al., 1998; Pearson, 1991; Pearson et al., 2017; . The in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective PCR test. To demonstrate the suitability of our in-house developed software tool PCRv, we determined the in silico sensitivity and specificity of three PCR tests for West Nile virus (WNV) recommended by the World Organisation for Animal Health (OIE) (Eiden et al., 2010; Johnson et al., 2001) . cache = ./cache/cord-256608-ajzk86rq.txt txt = ./txt/cord-256608-ajzk86rq.txt === reduce.pl bib === id = cord-001340-kqcx7lrq author = Ladner, Jason T. title = Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing date = 2014-06-17 pages = extension = .txt mime = text/plain words = 2512 sentences = 121 flesch = 40 summary = Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/ DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization. cache = ./cache/cord-001340-kqcx7lrq.txt txt = ./txt/cord-001340-kqcx7lrq.txt === reduce.pl bib === === reduce.pl bib === id = cord-017584-9rx4jlw8 author = Kim, Kwangsoo title = Selecting Genotyping Oligo Probes Via Logical Analysis of Data date = 2007 pages = extension = .txt mime = text/plain words = 3665 sentences = 216 flesch = 57 summary = Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. As for the organization of this paper, we develop an effective method for selecting short oligo probes in Section 2 (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in Section 3 with using viral genomic sequences from the Los Alamos National Laboratory and the National Center of Biotechnology Information websites. cache = ./cache/cord-017584-9rx4jlw8.txt txt = ./txt/cord-017584-9rx4jlw8.txt === reduce.pl bib === id = cord-002473-2kpxhzbe author = Das, Jayanta Kumar title = Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach date = 2017-03-31 pages = extension = .txt mime = text/plain words = 4613 sentences = 285 flesch = 61 summary = Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. The primary protein sequence is read as consecutive order pairs serially from first amino acid to the end of sequence, and each order pair is nothing but a connected edge between the two nodes where nodes in the graph are involved with different chemical groups of amino acids. Our method of phylogenetic tree formation used the dissimilarity matrix which is obtained for every pair of sequence on the basis of chemical group specific score of amino acids. Based on the phylogenetic tree of five members, we find that the PpcA and PpcD, PpcB and PpcE are mostly closed with regards to the frequency of amino acids of respective eight chemical groups. cache = ./cache/cord-002473-2kpxhzbe.txt txt = ./txt/cord-002473-2kpxhzbe.txt === reduce.pl bib === id = cord-010161-bcuec2fz author = Matson, David O. title = IV, 6. Calicivirus RNA recombination date = 2004-09-14 pages = extension = .txt mime = text/plain words = 3335 sentences = 168 flesch = 45 summary = With the description of statistically significant phylogenetic clades within CV genera, data were available to recognize strains that might be natural recombinants within CVs. Two examples are the well-characterized Argentine strain 320 (Arg320) and Snow Mountain virus (SMV), one of the prototype CVs, recognized to be recombinants when the RNA polymerase and capsid regions of these strains were characterized (Hardy et al., 1997; Jiang et al., 1999) (Fig. 2) . While SMV was likely also to be a recombinant virus, the capsid and RNA polymerase region amplicons of SMV were generated separately and that fact did not exclude the possibility of different sources of strains. Infection of single cells simultaneously by two CVs implies absence of immune or molecular and of 40 nt near the 5' end of that strain's capsid gene (ID="B" sequence for this Fig.) . The sequence data indicated that recombination in strain Arg320 occurred at the ORF1/capsid gene junction where high sequence identity exists between the putative parent clades. cache = ./cache/cord-010161-bcuec2fz.txt txt = ./txt/cord-010161-bcuec2fz.txt === reduce.pl bib === id = cord-011565-8ncgldaq author = Elworth, R A Leo title = To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics date = 2020-06-04 pages = extension = .txt mime = text/plain words = 12960 sentences = 717 flesch = 53 summary = For instance, in (1) a comprehensive review was performed covering probabilistic algorithms and data structures such as MinHash (6) and Locality Sensitive Hashing (LSH) (7) , Count-Min Sketch (CMS) (8) , HyperLogLog (9) and Bloom filters (10) . A more in depth discussion of many of these topics can also be found in (3, 4) includes a thorough review of compressed string indexes, LSH via sketches, CMS, Bloom filters, and minimizers (13) , with accompanying applications in genomics for each. With this approach, RAMBO can determine which datasets contain a given k-mer or sequence using far fewer Bloom filter queries, yielding a very fast sublinear-time sequence search algorithm (68) . One of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically MinHash and Minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses. cache = ./cache/cord-011565-8ncgldaq.txt txt = ./txt/cord-011565-8ncgldaq.txt === reduce.pl bib === id = cord-005060-n901y2d4 author = ZHANG, Feiyun title = Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus date = 2001 pages = extension = .txt mime = text/plain words = 2602 sentences = 173 flesch = 62 summary = The largest ORF 2 encodes a polyprotein of 947 amino acids (103.6 kDa), which codes for a serine protease and an RNA-dependent RNA polymerase. The genome sequence of sobernoviruses has been determined in Southern bean mosaic virus (SBMV)'2,24), CfMV8315), Rice yellow mottle virus (RYMV)") and Lucerne transient streak virus (LTSV, accession number U31286). However, the con-served sequence, WAG + E/D rich sequence is detected in the region, and putative E/S cleavage sites are present on both sides of the region : proteolytic cleavage would result in a protein of 9 kDa. Possibly, the VPg of RGMoV is located between the protease and the RNA-dependent RNA polymerase domains in the same order as in the SBMV ORF 222) (Fig. 3) . In the RGMoV RNA sequence, no ORF corresponds to the second largest product of 68 kDa. The putative replicase of CfMV is translated as part of a single polyprotein by -1 ribosomal frameshifting between two overlapping ORFs having a coding capacity for 60.9 kDa and 56.3 kDa proteins7J8). cache = ./cache/cord-005060-n901y2d4.txt txt = ./txt/cord-005060-n901y2d4.txt === reduce.pl bib === id = cord-001537-i34vmfpp author = Lima, Francisco Esmaile de Sales title = Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil date = 2015-02-17 pages = extension = .txt mime = text/plain words = 3874 sentences = 195 flesch = 53 summary = The predicted protein sequences encoded by ORF2 (cap) and ORF1 (rep) of BatCV I-VI genomes were used for phylogenetic analysis with representative and recently discovered circoviruses/cycloviruses; Pepper golden mosaic virus was used as outgroup, as they are somewhat related to other members in the Circoviridae family (Fig. 3A, 3B and 3C ). The phylogenetic analysis constructed based on the alignments of the complete REP and CAP protein confirms that BatCV POA/II and VI cluster into the genus Cyclovirus along with the Chinese cycloviruses sequences clade detected in bat feces [18] and sharing less than 65% of identity at the CAP/REP amino acid level. BatCV POA I and V had a low amino acid identity with CAP (<20%) and REP (<10%) sequences of two other sequences detected in bat feces in this study with known circoviruses/cycloviruses (Table 2) . cache = ./cache/cord-001537-i34vmfpp.txt txt = ./txt/cord-001537-i34vmfpp.txt === reduce.pl bib === id = cord-256278-jvfjf7aw author = Feng, Jie title = New method for comparing DNA primary sequences based on a discrimination measure date = 2010-10-21 pages = extension = .txt mime = text/plain words = 2864 sentences = 186 flesch = 53 summary = title: New method for comparing DNA primary sequences based on a discrimination measure Three years after, Blaisdell (1989) proved that the dissimilarity values observed by using distance measures based on word frequencies are directly related to the ones requiring sequence alignment. In Table 2 , we present the similarity/dissimilarity matrix for the full DNA sequences of bÀglobin gene from 10 species listed in Table 1 by our new method. In Fig. 2, we show the phylogenetic tree of 10 bÀglobin gene sequences based on the distance matrix DM, using NJ method. In this paper, we propose a new method for the similarity analysis of DNA sequences. Our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of DNA sequences. Analysis of similarity/ dissimilarity of DNA sequences based on novel 2-D graphical representation A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words cache = ./cache/cord-256278-jvfjf7aw.txt txt = ./txt/cord-256278-jvfjf7aw.txt === reduce.pl bib === id = cord-103297-4stnx8dw author = Widrich, Michael title = Modern Hopfield Networks and Attention for Immune Repertoire Classification date = 2020-08-17 pages = extension = .txt mime = text/plain words = 14093 sentences = 926 flesch = 57 summary = In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. DeepRC sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1D convolutions or LSTMs. In this work, we contribute the following: We demonstrate that continuous generalizations of binary modern Hopfield-networks (Krotov & Hopfield, 2016 Demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. We evaluate the predictive performance of DeepRC and other machine learning approaches for the classification of immune repertoires in a large comparative study (Section "Experimental Results") Exponential storage capacity of continuous state modern Hopfield networks with transformer attention as update rule cache = ./cache/cord-103297-4stnx8dw.txt txt = ./txt/cord-103297-4stnx8dw.txt === reduce.pl bib === id = cord-000642-mkwpuav6 author = Moreira, Rebeca title = Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing date = 2012-04-19 pages = extension = .txt mime = text/plain words = 6848 sentences = 372 flesch = 45 summary = title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum. Moreover, a few transcripts encoded by genes putatively involved in the clam immune response against Perkinsus olseni have been reported by cDNA library sequencing [18] . philippinarum transcriptome and another four bivalve species sequences were analyzed by comparative genomics (Crassostrea gigas of the family Ostreidae, Bathymodiolus azoricus and Mytilus galloprovincialis of the family Mytilidae and Laternula elliptica of the family Laternulidae). cache = ./cache/cord-000642-mkwpuav6.txt txt = ./txt/cord-000642-mkwpuav6.txt === reduce.pl bib === id = cord-255194-4i9fc0r7 author = Djikeng, Appolinaire title = Viral genome sequencing by random priming methods date = 2008-01-07 pages = extension = .txt mime = text/plain words = 3776 sentences = 207 flesch = 51 summary = An RNase treatment step was added to the SISPA protocol to reduce contaminating exogenous RNAs such as ribosomal RNAs. In the case of polyA-tailed viruses, we perform reverse transcription using a combination of random (FR26RV-N) and poly T tagged (FR40RV-T) primers in order to increase the coverage of the 3' end ( Figure 2 ). Additionally, in order to capture 5' ends of viral RNA, a random hexamer primer tagged with a conserved sequence at the 5' end was added to the Klenow reaction (Figure 2 shows a 5' oligo specific for rhinoviruses). The results of these experiments demonstrate that the SISPA method is very efficient as a genome sequencing method for samples with greater than 10 6 viral particles per RT-PCR reaction ( Figure 5 ). We strongly anticipate that specific adaptations of the SISPA method to conserved regions of different viruses will demonstrate its versatility in a wide range of viral genome sequencing initiatives. cache = ./cache/cord-255194-4i9fc0r7.txt txt = ./txt/cord-255194-4i9fc0r7.txt === reduce.pl bib === id = cord-023647-dlqs8ay9 author = nan title = Sequences and topology date = 2003-03-21 pages = extension = .txt mime = text/plain words = 4505 sentences = 747 flesch = 69 summary = Nucleotide Sequence Analysis of the L G~ne of Vesicular Stomafltia Virus (New Jersey Serotype) --Identification of Conserved Domai~L~ in L Proteins of Nonsegmented Negative-Strand RNA Viruses DERSE I~ Equine Infectious Anemia Virus tat--Insights into the Structure, Function, and Evolution of Lentivtrus tran.~Activator Proteins Ho~tu~ ~ s71 is a Ehylngcueticellly Distinct Human Endogenous Reteovtgal 1Rlement with Structural mad Sequence Homology to Simian Sarcoma Virus (SSV). Distinct Fercedoxins from Rhodobacter-Capsulstus -Complete Amino Acid Sequences and Molecular Evolution Complete Amino Acid Sequence and Homologies of Human Erythrocyte Membrane Protein Band 4.2. Identification of Two Highly Conserved Amino Acid Sequences Amon~ the ~x-subunits and Molecular ~ The Predicted Amino Acid Sequence of ct-lnternexin is that of a novel Neuronal lntegmedla~ ~ent Protein Inttaspecific Evolution of a Gene Family Coding for Urinary Proteins Attalysi~ of CDNA for Human ~ AJudgyrin I~dicltes a Repeated Structure with Homology to Tissue-Differentiation a~td Cell-Cycle Control Protein cache = ./cache/cord-023647-dlqs8ay9.txt txt = ./txt/cord-023647-dlqs8ay9.txt === reduce.pl bib === id = cord-016594-lj0us1dq author = Flower, Darren R. title = Identification of Candidate Vaccine Antigens In Silico date = 2012-09-28 pages = extension = .txt mime = text/plain words = 12570 sentences = 653 flesch = 37 summary = In the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. When looking at a reverse vaccinology process, the discovery of candidate subunit vaccines begins with a microbial genome, perhaps newly sequence, progresses through an extensive computational stage, ultimately to deliver a shortlist of antigens which can be validated through subsequent laboratory examination. Conventional empirical, experimental, laboratory-based microbiological ways to identify putative candidate antigens require cultivation of target pathogenic micro-organisms, followed by teasing out their component proteins, analysis in a series of in-vitro and in-vivo assays, animal models and with the ultimate objective of isolating one or two proteins displaying protective immunity. cache = ./cache/cord-016594-lj0us1dq.txt txt = ./txt/cord-016594-lj0us1dq.txt === reduce.pl bib === id = cord-022348-w7z97wir author = Sola, Monica title = Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing? date = 2007-09-02 pages = extension = .txt mime = text/plain words = 10892 sentences = 671 flesch = 56 summary = An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships. Under the rubric replication, a virus could vary to increase its fitness, exploit different target cells or evade adaptive immune responses. For a given virus, different protein sequence sets were compared to a given reference such as RT in the case of HIV/SIV. Although these data were derived from completely sequenced primate immunodeficiency viral genomes, analyses on larger data sets, such as p17 Gag/p24 Gag or gp120/gp41, yielded relative values that differed from those given in Table 6 .1 by at most 14%. An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships (Table 6 .1). In the clear cases where genetic variation is exploited by RNA viruses, it is used to overcome barriers to transmission set up by the host population, e.g. herd immunity. cache = ./cache/cord-022348-w7z97wir.txt txt = ./txt/cord-022348-w7z97wir.txt === reduce.pl bib === id = cord-264296-0x90yubt author = Sawmya, Shashata title = Analyzing hCov genome sequences: Applying Machine Intelligence and beyond date = 2020-06-03 pages = extension = .txt mime = text/plain words = 5008 sentences = 312 flesch = 60 summary = We present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. C. Several CNN-RNN based models are used to predict mutations at specific Sites of Interest (SoIs) of the sars-cov-2 genome sequence followed by further analyses of the same on several South-Asian countries. D. Overall, we present an analysis pipeline that can be further utilized as well as extended and revised (a) to study where a newly discovered genome sequence lies in relation to its predecessors in different regions of the world; (b) to analyse its virulence with respect to the number of deaths its predecessors have caused in their respective countries and (c) to analyse the mutation at specific important sites of the viral genome. cache = ./cache/cord-264296-0x90yubt.txt txt = ./txt/cord-264296-0x90yubt.txt === reduce.pl bib === id = cord-264135-s2u76pvk author = Patel, Amrutlal K. title = Complete genome sequence analysis of chicken astrovirus isolate from India date = 2016-12-23 pages = extension = .txt mime = text/plain words = 3755 sentences = 217 flesch = 49 summary = Phylogenetic analysis of the astrovirus genomes suggested formation of separate cluster of chicken astroviruses and placed CAstV/INDIA/ANAND/2016 nearest to the CAstV/4175 isolate (Fig. 2) . B-cell epitope analysis of capsid structural protein of identified chicken astrovirus isolate A total of 9-10 epitopes were predicted using SVMTriP using the capsid protein sequence of the astroviruses. Phylogenetic analysis of the genome sequences as well as the protein sequences showed clustering of the CAstV/ INDIA/ANAND/2016 nearest to that of CastV/4175 and CAstV/GA2011 and all four chicken astrovirus formed separate cluster except capsid protein of the CAstV/Poland/G059/ 2014 isolate which was clustered along with the duck astroviruses. The analysis of capsid protein sequence of reported chicken astroviruses from India revealed limited structural divergence suggesting their common ancestral origin and recent emergence. Fig. 4 Phylogenetic relatedness of chicken astrovirus isolate CAstV/India/Anand/2016 ORF2 coding sequences (a) and ORF2 encoded capsid protein (b) with reported Indian isolates based on neighbour-joining method with cache = ./cache/cord-264135-s2u76pvk.txt txt = ./txt/cord-264135-s2u76pvk.txt === reduce.pl bib === id = cord-203232-1nnqx1g9 author = Canturk, Semih title = Machine-Learning Driven Drug Repurposing for COVID-19 date = 2020-06-25 pages = extension = .txt mime = text/plain words = 5023 sentences = 257 flesch = 52 summary = Using the National Center for Biotechnology Information virus protein database and the DrugVirus database, which provides a comprehensive report of broad-spectrum antiviral agents (BSAAs) and viruses they inhibit, we trained ANN models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. Using sequences for SARS-CoV-2 (the coronavirus that causes COVID-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating COVID-19. For Experiment II, we split the data on virus species, meaning the models were forced to predict drugs for a species that it was not trained on, and have to detect peptide substructures in the amino-acid sequences to suggest drugs. In post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence. cache = ./cache/cord-203232-1nnqx1g9.txt txt = ./txt/cord-203232-1nnqx1g9.txt === reduce.pl bib === id = cord-035033-osjy88rc author = Aydin, Berkay title = Spatiotemporal event sequence discovery without thresholds date = 2020-11-09 pages = extension = .txt mime = text/plain words = 8231 sentences = 430 flesch = 54 summary = Here, we introduce a novel algorithm, RAND-ESMINER, which, by randomly repeating the mining process on a random subset of instances and follow relationships, finds an estimate participation index for event sequences. The RAND-ESMINER uses our pattern growth-based ESGROWTH algorithm [4] as the backbone, where the follow relationships are translated into a directed acyclic graph structure, and randomly permutes the edges of this graph to mine the event sequences. They defined a follow relation between the pointbased event instances of two different event types, presented significance measures for sequences, and introduced two pattern-growth based algorithms for the mining task. In this paper, we will focus on mining STESs using a randomization approach, which will take a set of spatiotemporal event instances as input and returns all the discovered STESs together with a list of estimated participation index values for each STES, obtained from randomized trials. cache = ./cache/cord-035033-osjy88rc.txt txt = ./txt/cord-035033-osjy88rc.txt === reduce.pl bib === id = cord-266288-buc4dd5y author = Dong, Rui title = A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance date = 2019-04-09 pages = extension = .txt mime = text/plain words = 5247 sentences = 300 flesch = 61 summary = Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ(18). The natural vector method performs well on many datasets (Deng et al., 2011; Yu et al., 2013b; Hoang et al., 2016; Li et al., 2016) , however, it only considers the number, average position and dispersion of positions of each nucleotide. In this paper, we propose a new Accumulated Natural Vector (ANV) method, which not only considers the basic property of each nucleotide, but also the covariance between them. In this paper, we propose an Accumulated Natural Vector approach, which projects each sequence into a point in R 18 , where the additional six dimensions describe the covariance between nucleotides. cache = ./cache/cord-266288-buc4dd5y.txt txt = ./txt/cord-266288-buc4dd5y.txt === reduce.pl bib === id = cord-266960-kyx6xhvj author = Temple, Mark D. title = Real-time audio and visual display of the Coronavirus genome date = 2020-10-02 pages = extension = .txt mime = text/plain words = 6780 sentences = 360 flesch = 56 summary = The sonification of codons derived from all three reading frames of the viral RNA sequence in combination with sonified metadata provide the framework for this display. CONCLUSION: The auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the RNA genome. Audio generated from each of these sequence motifs and metadata were combined to create a complex auditory display to represent either transcription or translation. High resolution analysis of gene expression in Coronavirus genomes has detected ribosome protected fragments which map to non-canonical ORF's, these may be novel protein-coding ORFs and short regulatory uORFs. The tool highlights the occurrence of one such uORF of 30 nucleotides (including the stop codon) in the 5′ untranslated region downstream of TRS1 [35] that is not documented in the GenBank metadata. In the Additional file 4: supplementary example 'Sonification Sub-genomic RNA' the auditory display represents the process of transcription. cache = ./cache/cord-266960-kyx6xhvj.txt txt = ./txt/cord-266960-kyx6xhvj.txt === reduce.pl bib === id = cord-018133-2otxft31 author = Altman, Russ B. title = Bioinformatics date = 2006 pages = extension = .txt mime = text/plain words = 9592 sentences = 462 flesch = 46 summary = Experimentation and bioinformatics have divided the research into several areas, and the largest are: (1) genome and protein sequence analysis, (2) macromolecular structure-function analysis, (3) gene expression analysis, and (4) proteomics. With the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. The Entrez system from the National Center for Biological Information (NCBI) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the Human Genome Project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources. cache = ./cache/cord-018133-2otxft31.txt txt = ./txt/cord-018133-2otxft31.txt === reduce.pl bib === id = cord-001786-ybd8hi8y author = Dutilh, Bas E title = Metagenomic ventures into outer sequence space date = 2014-12-15 pages = extension = .txt mime = text/plain words = 2193 sentences = 121 flesch = 44 summary = These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. Applications include the use of metagenomics for the discovery of novel genetic functionality, 2 for describing microbial ecosystems and tracking their variation, 3 in untargeted medical diagnostics and forensics, 4 and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. The level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database. cache = ./cache/cord-001786-ybd8hi8y.txt txt = ./txt/cord-001786-ybd8hi8y.txt === reduce.pl bib === id = cord-003316-r5te5xob author = Balloux, Francois title = From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic date = 2018-12-17 pages = extension = .txt mime = text/plain words = 7340 sentences = 327 flesch = 34 summary = WGS-based strain identification gives a far superior resolution In principle, WGS can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. As an example, genome assembly might appear to be a bottleneck for real-time WGS diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. These include, among others: the current costs of WGS, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable AMR and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols. cache = ./cache/cord-003316-r5te5xob.txt txt = ./txt/cord-003316-r5te5xob.txt === reduce.pl bib === id = cord-300796-rmjv56ia author = nan title = The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation date = 1990-09-01 pages = extension = .txt mime = text/plain words = 8031 sentences = 405 flesch = 57 summary = In this work we show that the p62 protein of Semliki Forest virus contains an uncleaved signal sequence at its NH2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. As the glycosylation of the signal sequence most likely occurs after its release from the ER membrane our results suggest that this region has no role in completing the transfer process. Furthermore, the p62-reporter hybrid should be translocated across microsomal membranes and possibly glycosylated at Asn~3 of the p62 sequence if the 40 residues long NH2-terminal p62 peptide carries a signal sequence. This must involve Asn~3 of the p62 peptide as it is part of the only potential glycosylation site on the hybrid polypeptides (Garoff et al., 1980 ; references on dhfr sequence in legend to Fig. 1) , Finally, we can also conclude that the p62 signal sequence does not provide a stable membrane anchor to the translocated chain. cache = ./cache/cord-300796-rmjv56ia.txt txt = ./txt/cord-300796-rmjv56ia.txt === reduce.pl bib === id = cord-017932-vmtjc8ct author = Georgiev, Vassil St. title = Genomic and Postgenomic Research date = 2009 pages = extension = .txt mime = text/plain words = 8476 sentences = 360 flesch = 36 summary = The family Enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (Salmonella, Yersinia, Klebsiella, Shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic Escherichia coli K12. To this end, NIAID has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. The availability of microbial and human DNA sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host's immune response and an individual's genetic susceptibility to pathogens. The PFGRC was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases. cache = ./cache/cord-017932-vmtjc8ct.txt txt = ./txt/cord-017932-vmtjc8ct.txt === reduce.pl bib === id = cord-265857-fs6dj3dp author = Liu, Yu-Tsueng title = Infectious Disease Genomics date = 2010-12-24 pages = extension = .txt mime = text/plain words = 4341 sentences = 233 flesch = 45 summary = The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002 (Gardner et al., 2002; Holt et al., 2002) . Genome sequencing projects for other important human disease vectors are in progress Megy et al., 2009 ). One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. The completed or ongoing genome projects (Table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. cache = ./cache/cord-265857-fs6dj3dp.txt txt = ./txt/cord-265857-fs6dj3dp.txt === reduce.pl bib === id = cord-010273-0c56x9f5 author = Simmonds, Peter title = Virology of hepatitis C virus date = 2001-10-10 pages = extension = .txt mime = text/plain words = 7897 sentences = 337 flesch = 41 summary = 1,2 The identification of HCV led to the development of diagnostic assays for infection, based either on detection of antibody to recombinant polypeptides expressed from cloned HCV sequences or direct detection of virus ribonucleic acid (RNA) sequences by polymerase chain reaction (PCR) using primers complimentary to the HCV genome. 6 '13 Remarkably, a series of plant viruses that are structurally distinct from each of the mammalian virus groups, and with different genome organizations, have RNA-dependent RNA polymerase amino acid sequences that are perhaps more similar to those of HCV than are the flaviviruses. In contrast to the highly restricted sequence diversity of the 5'NCR and adjacent core region, the two putative envelope genes are highly divergent between different variants of HCV (Table III) 111-114 and show a three-to-four-times higher rate of sequence change with time in persistently infected patients, ll5 Because these proteins are likely to lie on the outside of the virus, they would be the principal targets of the humoral immune response to HCV elicited on infection. cache = ./cache/cord-010273-0c56x9f5.txt txt = ./txt/cord-010273-0c56x9f5.txt === reduce.pl bib === id = cord-010499-yefxrj30 author = Yelverton, Elizabeth title = The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli date = 2006-10-27 pages = extension = .txt mime = text/plain words = 5883 sentences = 330 flesch = 60 summary = Ribosomal frameshifting in both rightward and leftward directions has also been shown to occur at certain 'hungry' codons whose cognate aminoacyi-tRNAs are in short supply (Gallant and Foley, 1980; Weiss and Gailant, 1983; 1986; Gallant et ai, 1985; Kurland and Gallant, 1986) . Not all hungry codons are equally prone to shift: in a survey of 21 frameshift mutations of the rllB gene of phage T4, Weiss and Gallant (1986) found that oniy a minority were phenotypicaily suppressible when challenged by limitation for any of several aminoacyl-tRNAs. The context njies governing ribosome frameshifting at hungry sites are under investigation, and have been defined in a few cases (Weiss et al., 1988; Gallant and Lindsiey, 1992; Peter et ai. coli the rate of ribosomal frameshifting on that sequence can be increased by limitation for leucine, the amino acid encoded at the frameshift site. cache = ./cache/cord-010499-yefxrj30.txt txt = ./txt/cord-010499-yefxrj30.txt === reduce.pl bib === id = cord-263987-ff6kor0c author = Holmes, Ian H. title = Solving the master equation for Indels date = 2017-05-12 pages = extension = .txt mime = text/plain words = 7131 sentences = 357 flesch = 44 summary = BACKGROUND: Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT: This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS: While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances. cache = ./cache/cord-263987-ff6kor0c.txt txt = ./txt/cord-263987-ff6kor0c.txt === reduce.pl bib === id = cord-022494-d66rz6dc author = Webb, B. title = Comparative Modeling of Drug Target Proteins date = 2014-10-01 pages = extension = .txt mime = text/plain words = 8782 sentences = 453 flesch = 47 summary = Comparative modeling consists of four main steps 23 (Figure 2 (a)): (1) fold assignment that identifies similarity between the target sequence of interest and at least one known protein structure (the template); (2) alignment of the target sequence and the template(s); (3) building a model based on the alignment with the chosen template(s); and (4) predicting model errors. Modeller implements comparative protein structure modeling by the satisfaction of spatial restraints that include: (1) homologyderived restraints on the distances and dihedral angles in the target sequence, extracted from its alignment with the template structures; 35 (2) stereochemical restraints such as bond length and bond angle preferences, obtained from the CHARMM-22 molecular mechanics force field; 107 (3) statistical preferences for dihedral angles and nonbonded interatomic distances, obtained from a representative set of known protein structures; 108 and (4) optional manually curated restraints, such as those from NMR spectroscopy, rules of secondary structure packing, cross-linking experiments, fluorescence spectroscopy, image reconstruction from electron microscopy, site-directed mutagenesis, and intuition ( Figure 2(b) ). cache = ./cache/cord-022494-d66rz6dc.txt txt = ./txt/cord-022494-d66rz6dc.txt === reduce.pl bib === id = cord-253436-dz84icdc author = Wille, Michelle title = High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl date = 2016-03-03 pages = extension = .txt mime = text/plain words = 2019 sentences = 103 flesch = 54 summary = In this study we screened 764 samples from 22 avian species of the orders Anseriformes and Charadriiformes in Sweden collected in 2006/2007 for CoV, with an overall CoV prevalence of 18.7%, which is higher than many other wild bird surveys. Coronavirus sequences from Mallards in this study were highly similar to CoV sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. Despite few studies, small samples sizes and differences in prevalence, what is clear, is that in the Northern Hemisphere waterfowl species, especially dabbling and diving ducks are important in the epidemiology of avian CoVs. It is interesting to note that these patterns are very similar to those found in low pathogenic influenza A viruses: high prevalence in waterfowl and gulls in the Northern Hemisphere [30] , and little host species and temporal structuring within waterfowl derived viruses in the conserved polymerase genes (such as PB2, PB1) [31] . cache = ./cache/cord-253436-dz84icdc.txt txt = ./txt/cord-253436-dz84icdc.txt === reduce.pl bib === id = cord-193910-7p3f3znj author = Zhang, Xiangxie title = Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification date = 2020-11-01 pages = extension = .txt mime = text/plain words = 7724 sentences = 436 flesch = 59 summary = In the experiments, the performances of feature extraction using primers and random DNA sequences will be compared to several other machine learning approaches. Finally, three state-of-the-art methods, namely a con-volutional neural network (CNN), a deep neural network (DNN), and an N-gram probabilistic model, which were fed the unprocessed DNA sequences without prior feature extraction, were tested. Different machine learning algorithms will be trained and tested using each set of feature vectors in the experiments. For each data set, the results of all six machine learning algorithms using the random DNA sequence feature extraction method are presented in Table ( 8) containing mean accuracy and standard deviation over the ten folds of the cross-validation. It can be concluded that the Levenshtein distance feature extraction yields the best and most consistent results across the six different machine learning algorithms when the distance between a primer and a DNA sequence is taken. cache = ./cache/cord-193910-7p3f3znj.txt txt = ./txt/cord-193910-7p3f3znj.txt === reduce.pl bib === id = cord-017354-cndb031c author = Janies, D. title = Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases date = 2008 pages = extension = .txt mime = text/plain words = 12429 sentences = 648 flesch = 45 summary = The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. Given a tree and a data matrix of sequences and features, the parsimony method can pinpoint the branches on which certain evolutionary events are inferred to occur between ancestor or descendent. Phylogenetic analysis of large genomic datasets can present several nested NPcomplete problems: multiple alignment, tree-search, and in some cases, gene order and complement differences among organisms. We provide exemplar cases in which phylogenetic analyses of viral genomes have been crucial to understand complex patterns of transmission among animal and human hosts: Severe Acute Respiratory Syndrome (SARS) [KSI03] and influenza [WEB92] . Molecular phylogenetic analyses of the nucleotide or inferred amino acid sequence data from various viral isolates can then be used to reconstruct the history of the transmission events the virus among hosts. cache = ./cache/cord-017354-cndb031c.txt txt = ./txt/cord-017354-cndb031c.txt === reduce.pl bib === id = cord-255371-o9oxchq6 author = Nguyen, Thanh Thi title = Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) date = 2020-07-10 pages = extension = .txt mime = text/plain words = 5640 sentences = 365 flesch = 59 summary = title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. We use 6,324 SARS-CoV-2 genome sequences collected in 45 countries and deposited to the NCBI GenBank so far and create a spreadsheet dataset of all mutations occurred across different genes. In this paper, to evaluate the possible impacts of genomic mutations on the virus functions, we propose the use of the SSpro/ACCpro 5 methods to predict protein secondary structure and relative solvent accessibility [13] . By comparing the prediction results obtained on the reference genome and mutated genomes, we are able to assess whether the detected mutations have the potential to change the protein structure and solvent accessibility, and thus lead to possible changes of the virus characteristics. cache = ./cache/cord-255371-o9oxchq6.txt txt = ./txt/cord-255371-o9oxchq6.txt === reduce.pl bib === id = cord-014462-11ggaqf1 author = nan title = Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh date = 2011-04-21 pages = extension = .txt mime = text/plain words = 35453 sentences = 1711 flesch = 49 summary = Molecular diagnosis based on reverse transcription (RT)-PCR s.a. one step or nested PCR, nucleic acid sequence based amplification (NASBA), or real time RT-PCR, has gradually replaced the virus isolation method as the new standard for the detection of dengue virus in acute phase serum samples. Non-genetic methods of management of these diseases include quarantine measures, eradication of infected plants and weed hosts, crop rotation, use of certified virus-free seed or planting stock and use of pesticides to control insect vector populations implicated in transmission of viruses. The results of this study indicate that NS1 antigen based ELISA test can be an useful tool to detect the dengue virus infection in patients during the early acute phase of disease since appearance of IgM antibodies usually occur after fifth day of the infection. The studies showed high level of expression in case of constructed vector as compared to infected virus for the specific protein. cache = ./cache/cord-014462-11ggaqf1.txt txt = ./txt/cord-014462-11ggaqf1.txt === reduce.pl bib === id = cord-014461-2ubh9u8r author = Nelson, Oranmiyan W. title = Genome sequences published outside of Standards in Genomic Sciences, July - October 2012 date = 2012-10-10 pages = extension = .txt mime = text/plain words = 4124 sentences = 454 flesch = 44 summary = Complete Genome Sequence of Brucella abortus A13334, a New Strain Isolated from the Fetal Gastric Fluid of Dairy Cattle Complete Genome Sequence of Brucella canis Strain HSK A52141, Isolated from the Blood of an Infected Dog Complete Genome Sequence of Streptococcus salivarius PS4, a Strain Isolated from Human Milk Complete Genome Sequences of Probiotic Strains Bifidobacterium animalis subsp. Complete Genome Sequence of Corynebacterium pseudotuberculosis Strain 1/06-A, Isolated from a Horse in North America Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Characterization and Complete Genome Sequence of Human Coronavirus NL63 Isolated in China Complete Genome Sequence of a Novel Pararetrovirus Isolated from Soybean Complete Genome Sequence of a Polyomavirus Isolated from Horses Complete Genome Sequence of a Novel Porcine Sapelovirus Strain YC2011 Isolated from Piglets with Diarrhea Draft Genome Sequence of Aspergillus oryzae Strain 3.042 cache = ./cache/cord-014461-2ubh9u8r.txt txt = ./txt/cord-014461-2ubh9u8r.txt === reduce.pl bib === id = cord-268549-2lg8i9r1 author = Dai, Qi title = Sequence comparison via polar coordinates representation and curve tree date = 2012-01-07 pages = extension = .txt mime = text/plain words = 4360 sentences = 272 flesch = 59 summary = It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation cache = ./cache/cord-268549-2lg8i9r1.txt txt = ./txt/cord-268549-2lg8i9r1.txt === reduce.pl bib === id = cord-001974-wjf3c7a7 author = Friis-Nielsen, Jens title = Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers date = 2016-02-19 pages = extension = .txt mime = text/plain words = 5773 sentences = 348 flesch = 48 summary = Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. The datasets went through a sequential pipeline with modules (in order) of preprocessing, computational subtraction of host sequences, low-complexity sequence removal, sequence assembly, clustering, association to metadata features, and taxonomical annotation. Associations from the shortest mode tended to have higher dispersion in the range of ORs. Furthermore, one block of clustering results using global alignment mode, alignment length based on the shortest contig, and a minimum sequence identity of 90% (c09ˆaSyG1), had an overall high range of ORs as well as the highest minimum values. The clusters are significantly associated with lowest p-values to biological features and the species annotations are described by HMP. cache = ./cache/cord-001974-wjf3c7a7.txt txt = ./txt/cord-001974-wjf3c7a7.txt === reduce.pl bib === id = cord-275258-azpg5yrh author = Mead, Dylan J.T. title = Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling date = 2019-07-26 pages = extension = .txt mime = text/plain words = 6333 sentences = 346 flesch = 53 summary = title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. We then present the first use of force-directed graphs to produce an intuitive visualization of sequence space, and select target RdRPs without solved structures for homology modelling. The solved structure has 10 other sequences in its proximity in the three-dimensional space, roughly Table 5 Homology modelling at intra-order, inter-family level. cache = ./cache/cord-275258-azpg5yrh.txt txt = ./txt/cord-275258-azpg5yrh.txt === reduce.pl bib === id = cord-023208-w99gc5nx author = nan title = Poster Presentation Abstracts date = 2006-09-01 pages = extension = .txt mime = text/plain words = 70854 sentences = 3492 flesch = 43 summary = In order to develop a synthetic protocol by an automated instrumentation, increasing yield, purity of the crude, and reaction time, a microwave-assisted solid phase peptide synthesis was validated comparing the use of the new generation of Triazine-Based Coupling Reagents (TBCRs) with a series of commonly used ones. Ubiquitinium is a well known mechanism in protein degredation of Eukaryotic cells ,in which many obsolte and corrupted three dimentional structure protein ,become marked by covalent attachment of ubuquitin through a multi-step enzymatic pathway.Ubiquitin is a small ,8.5 kDa peptide of 76 amino acid residues that targets such substrtes for proteolysis in proteasome .Recnt studies showed that an extra cellular ubiquitination process also taking place in the epididymes of humans and other animals marks protein on the surface of the defective sperm .it appears that structurally and functionally defective sperm become surface ubiquitinated by epididymal epithelial cells. This head-to-tailcyclized 14-amino-acid peptide contains one disulfide bridge and a lysine residue (Lys5) present in the P1 position, which is responsible for inhibitor specificity.As was reported by us and other groups, SFTI-1 analogues with one cycle only retain trypsin inhibitory activity. cache = ./cache/cord-023208-w99gc5nx.txt txt = ./txt/cord-023208-w99gc5nx.txt === reduce.pl bib === id = cord-321386-u1imic5l author = Li, Chun title = Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation date = 2018-02-17 pages = extension = .txt mime = text/plain words = 5503 sentences = 311 flesch = 59 summary = METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. Numerical characterization of protein sequences based on the generalized Chou's pseudo amino acid composition cache = ./cache/cord-321386-u1imic5l.txt txt = ./txt/cord-321386-u1imic5l.txt === reduce.pl bib === id = cord-306725-0vam15pt author = Li, Hao title = First detection and genomic characteristics of bovine torovirus in dairy calves in China date = 2020-05-09 pages = extension = .txt mime = text/plain words = 3015 sentences = 156 flesch = 58 summary = Sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the S protein compared to the complete S sequences of BToV available in the GenBank database. A phylogenetic analysis based on the complete amino acid sequence of the S protein showed that the BToVs could be separated into four groups (Fig. 2) , designated tentatively as group 1 to group 4. The bovine torovirus strains BToV/SC-1/China and BToV /SC-2/China investigated in this study are indicated by black triangles Fig. 2 Phylogenetic tree based on the deduced 1586-aa sequence of the complete S gene. Moreover, the two Chinese strains shared identical unique amino acid changes in the S and HE genes when compared to the other strains with sequences available in the GenBank database, indicating the unique evolution in Chinese BToV strains. Moreover, two complete BToV genome sequences were obtained from the clinical samples, and these two BToV isolates had unique amino acid changes in the S and HE proteins. cache = ./cache/cord-306725-0vam15pt.txt txt = ./txt/cord-306725-0vam15pt.txt === reduce.pl bib === id = cord-027316-echxuw74 author = Modarresi, Kourosh title = Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model date = 2020-05-22 pages = extension = .txt mime = text/plain words = 2116 sentences = 148 flesch = 49 summary = This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. Deep Learning has become a main model in natural language processing applications [6, 7, 11, 22, 38, 55, 64, 71, 75, 78-81, 85, 88, 94] . Though, modified version of RNN like LSTM and GRU have been improvement over RNN (recurrent neural networks) in dealing with vanishing gradients and long-term memory loss, still they suffer from many deficiencies. Given the complexity of these dependencies, a neural network model is used to compute these weights. The embedding regularization is, α Embedding Error 2 (6) Input to any model has to be a number and hence the raw input of words or text sequence needs to be transformed to continuous numbers. Learning phrase representations using RNN encoder-decoder for statistical machine translation cache = ./cache/cord-027316-echxuw74.txt txt = ./txt/cord-027316-echxuw74.txt === reduce.pl bib === id = cord-213136-euv6pqh5 author = Singh, Kulveer title = Sequence Effects on Internal Structure of Droplets of Associative Polymers date = 2020-05-17 pages = extension = .txt mime = text/plain words = 4329 sentences = 184 flesch = 56 summary = We study the evolution of internal structure of large droplets (morphology of clusters of stickers) and the kinetics of interconversion between intramolecular and intermolecular associations, for different sequences of our model polymers. Since at t = 0 we begin with a dilute solution of associating polymers in poor solvent in which most of the chains contain intramolecular bonds between their stickers, the observation of a second peak that corresponds to intermolecular bridges means that major molecular rearrangement takes place inside droplets formed by polymers with s8s, 1s6s1 and 2s4s2 sequences. For three of the sequences (s8s, 1s6s1 and 2s4s2) we found that the average spatial distance R ss between the two stickers of a polymer inside the condensed droplet has a bimodal distribution, such that one of the peaks corresponds to intramolecular bonds and the other to intermolecular bridges between clusters (or between different parts of a long fiber of stickers). cache = ./cache/cord-213136-euv6pqh5.txt txt = ./txt/cord-213136-euv6pqh5.txt === reduce.pl bib === === reduce.pl bib === === reduce.pl bib === id = cord-252347-vnn4135b author = Lee, Wai-Ming title = A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants date = 2007-10-03 pages = extension = .txt mime = text/plain words = 5672 sentences = 271 flesch = 51 summary = METHODS AND FINDINGS: To directly type HRVs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5' noncoding region with homologous sequences of the 101 known serotypes. The degenerate primers EV292 and EV222 for PCR amplification of NIm-1A region were not sensitive enough for direct detection of small amount of HRV in original clinical samples (data not shown), and high titer infected cell lysates of cultured isolates were needed to produce enough PCR product for cloning and sequencing. This new assay had 3 key components: sensitive pan-HRV primers and semi-nested PCR to amplify P1-P2 region from cDNA prepared from original clinical specimens, a sequence database of 260-bp P1-P2 region of 5'NCR of all 101 HRV serotypes to serve as standard references for HRV identification, and phylogenetic tree reconstruction of the new P1-P2 sequences and the 101 homologous reference sequences. cache = ./cache/cord-252347-vnn4135b.txt txt = ./txt/cord-252347-vnn4135b.txt === reduce.pl bib === id = cord-264746-gfn312aa author = Muse, Spencer title = GENOMICS AND BIOINFORMATICS date = 2012-03-29 pages = extension = .txt mime = text/plain words = 10976 sentences = 583 flesch = 58 summary = The success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of DNA and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of DNA; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. Although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today's environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism's genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research. cache = ./cache/cord-264746-gfn312aa.txt txt = ./txt/cord-264746-gfn312aa.txt === reduce.pl bib === id = cord-267500-x3u9i1vq author = Rose, Rebecca title = Challenges in the analysis of viral metagenomes date = 2016-08-03 pages = extension = .txt mime = text/plain words = 5928 sentences = 308 flesch = 40 summary = Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of lowfrequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. We recently proposed a method based on numerical sequence representations and digital signal processing data transformation (SPDT) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (Tapinos et al. cache = ./cache/cord-267500-x3u9i1vq.txt txt = ./txt/cord-267500-x3u9i1vq.txt === reduce.pl bib === id = cord-311240-o0zyt2vb author = Motayo, Babatunde Olarenwaju title = Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences date = 2020-07-27 pages = extension = .txt mime = text/plain words = 3091 sentences = 167 flesch = 50 summary = Our study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa. The pathogen was later identified to be a novel coronavirus closely related to the severe acute respiratory syndrome virus (SARS), with a possible bat origin (Zhou et al, 2020) . This study was designed to determine to the genetic diversity and evolutionary history of genome sequences of SARSCoV-2 isolated in Africa. Results of recombination analysis of the African SARSCoV-2 (AfrSARSCoV-2) sequences against references whole genome sequences of SARS, Recombination signals were observed between the African SARSCoV-2 sequences and reference sequence (Major recombinant hCoV-19 Pangolin/Guangu P4L/2017; Minor parent hCoV-19 B batYunan/RaTG13) between the RdRP and S gene regions (Figure 2 ). cache = ./cache/cord-311240-o0zyt2vb.txt txt = ./txt/cord-311240-o0zyt2vb.txt === reduce.pl bib === id = cord-321715-bkfkmtld author = Redelings, Benjamin D title = Incorporating indel information into phylogeny estimation for rapidly emerging pathogens date = 2007-03-14 pages = extension = .txt mime = text/plain words = 9793 sentences = 546 flesch = 54 summary = To see if indel information improves phylogenetic resolution we compare the number of bi-partitions that are supported under the joint model and the traditional sequential approach, in which topology reconstruction assumes a previously determined alignment. These parameters include a multiple alignment A that specifies the positional homology between the sequences Y, an evolutionary tree (τ, T) where τ is an unrooted bifurcating tree topology and T = (t 1 , ..., t 2N -3 ) is a vector of branch lengths along the edges in τ, and vectors Θ and Λ are parameters that characterize the letter substitution and indel processes respectively. We therefore propose a new pairwise alignment prior that maintains a fixed sequence length distribution φ even when the indel probability varies from branch to branch. Since the joint model balances substitution and indel information as well as taking alignment ambiguity into account we assume that these differences represent an improvement in the accuracy of estimation. cache = ./cache/cord-321715-bkfkmtld.txt txt = ./txt/cord-321715-bkfkmtld.txt === reduce.pl bib === id = cord-311839-61djk4bs author = Wei, Dan title = A novel hierarchical clustering algorithm for gene sequences date = 2012-07-23 pages = extension = .txt mime = text/plain words = 8033 sentences = 496 flesch = 61 summary = We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. DMk shows better performance than the k-tuple distance in our experiments, and mBKM outperforms SL, CL, AL, BKM and KM when tested on public gene sequence datasets. In this paper we propose a new alignment-free similarity measure, DMk, based on which we developed mBKM to cluster gene sequences. To evaluate the proposed similarity measure, we test DMk on gene sequence data sets and compare it with the k-tuple distance. Moreover, we use our method, mBKM with similarity measure DMk, in phylogenetic analysis to show how well the genes are grouped together and how well the resulting trees agree with existing phylogenies. In order to illustrate the efficiency of mBKM in gene sequence clustering, we ran mBKM with the k-tuple distance and DMk on real data sets listed in Table 1 . cache = ./cache/cord-311839-61djk4bs.txt txt = ./txt/cord-311839-61djk4bs.txt === reduce.pl bib === id = cord-018963-2lia97db author = Xu, Ying title = Protein Structure Prediction by Protein Threading date = 2010-04-29 pages = extension = .txt mime = text/plain words = 15309 sentences = 716 flesch = 48 summary = Their follow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now. cache = ./cache/cord-018963-2lia97db.txt txt = ./txt/cord-018963-2lia97db.txt === reduce.pl bib === === reduce.pl bib === id = cord-102766-n6mpdhyu author = Alam, Md. Nafis Ul title = Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses date = 2020-06-25 pages = extension = .txt mime = text/plain words = 3193 sentences = 192 flesch = 56 summary = title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. VirFinder: a novel k-mer based tool for identifying viral sequences from 558 assembled metagenomic data. cache = ./cache/cord-102766-n6mpdhyu.txt txt = ./txt/cord-102766-n6mpdhyu.txt === reduce.pl bib === === reduce.pl bib === id = cord-321150-ev6acl7b author = Lam, Ha Minh title = Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm date = 2017-10-03 pages = extension = .txt mime = text/plain words = 3184 sentences = 161 flesch = 50 summary = Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed. A strong descent or ascent in the middle of a HGRW indicates that one type of informative site exhibits clustering, and the properties of the random walk can be used to compute exact probabilities of this occurring. To illustrate improved runtimes and memory usage of the new 3SEQ algorithm, we searched for recombinants among large sequence data sets of dengue virus serotype 2, Ebola virus, the coronavirus responsible for Middle-East Respiratory Syndrome (MERS) and Zika virus; see table 1. The genomic alignments of MERS and Zika virus contained 1,150 and 2,792 polymorphic sites, respectively, and >99.9% triplets were able to be tested for mosaicism with exact P values. cache = ./cache/cord-321150-ev6acl7b.txt txt = ./txt/cord-321150-ev6acl7b.txt === reduce.pl bib === id = cord-302798-q0mbngqy author = Ge, Junwei title = Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China date = 2018-06-14 pages = extension = .txt mime = text/plain words = 4343 sentences = 273 flesch = 58 summary = In this study, the role of circoviruses (CVs) in mink acute gastroenteritis was investigated, and the MiCV genome was molecularly characterized through sequence analysis. MiCVs and previously characterized CVs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus DNA replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (Rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5ʹ region; and (iv) motifs in Rep. Pairwise comparisons showed that the capsid proteins of MiCV shared the highest amino acid sequence identity with those of porcine CV (PCV) 2 (45.4%) and bat CV (BatCV) 1 (45.4%). In our study, sequence analysis confirmed that MiCV genomes displayed the characteristics of members of the genus Circovirus, and the common features included their genome organization, the presence of a potential stem-loop and conserved nonanucleotide motif postulated to be the origin of viral DNA replication, and major ORFs and repeats [26, 27] . cache = ./cache/cord-302798-q0mbngqy.txt txt = ./txt/cord-302798-q0mbngqy.txt === reduce.pl bib === id = cord-266794-oyppubq5 author = Zhang, Dachuan title = SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model date = 2020-09-01 pages = extension = .txt mime = text/plain words = 1003 sentences = 75 flesch = 48 summary = title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. To circumvent this limitation, we built an integrated 2019-nCoV scientific resource platform and a consensus sequence-catalytic function model with which we developed novel methodology to analyze pathogen sequences for catalytic functions. In addition, we integrated a consensus sequence-function model (Zhang, et al., 2020) , a genome browser (Ham, et al., 2012) , and a catalytic function annotation tool (Dawson, et al., 2017) into the platform to assist in the research of novel viruses. We built an integrated platform to assist 2019-nCoV research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species. cache = ./cache/cord-266794-oyppubq5.txt txt = ./txt/cord-266794-oyppubq5.txt === reduce.pl bib === === reduce.pl bib === id = cord-280881-5o38ihe0 author = Wlodawer, Alexander title = A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases date = 2003-11-11 pages = extension = .txt mime = text/plain words = 4862 sentences = 220 flesch = 51 summary = These structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (Ser, Glu, Asp) and by the presence of an Asp in the oxyanion hole [8] . We have now applied the tools of molecular homology modeling to predicting a structure of CLN2 that could be used as a basis for a search for the biological substrates of this family of enzymes and for the design of specific inhibitors. Mammalian enzymes homologous to human CLN2 [2, 4] form a subfamily of sedolisins with highly conserved sequences ( Figure 1 ). Exploiting the sequence similarity between CLN2, sedolisin, and kumamolisin ( Figure 4 ), we have now used the experimentally obtained structures of the latter two enzymes to form a new, homology-derived model of human CLN2. cache = ./cache/cord-280881-5o38ihe0.txt txt = ./txt/cord-280881-5o38ihe0.txt === reduce.pl bib === id = cord-274056-9t3kneoo author = Abd Elwahaab, Marwa A. title = A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector date = 2019-05-08 pages = extension = .txt mime = text/plain words = 3314 sentences = 251 flesch = 59 summary = title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector For beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in Table 1 . The similarity/dissimilarity vectors that are corresponding to beta globin, ND5, and spike protein sequences are illustrated in Tables 9, 10, and 11, respectively, based on the two methods discussed before. The results in Table 10 show that both the magnitude ( 5 ) and the angle ( 5 ) can measure similarity/dissimilarity degree well among ND5 protein sequences as shown in Figure 2 . The similarity/dissimilarity analysis among the seven beta globin sequences measured according to ( 5 ) is illustrated in Table 12 and shown in Figure 4 . The similarity/dissimilarity analysis among the beta globin sequences measured according to (GR spike ) is illustrated in Table 14 and shown in Figure 6 . cache = ./cache/cord-274056-9t3kneoo.txt txt = ./txt/cord-274056-9t3kneoo.txt === reduce.pl bib === id = cord-325985-xfzhn1n1 author = Jabado, Omar J. title = Comprehensive viral oligonucleotide probe design using conserved protein regions date = 2007-12-13 pages = extension = .txt mime = text/plain words = 4260 sentences = 227 flesch = 47 summary = The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. Our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. The EMBL nucleotide sequence database [July 2007, Release 91; 461,353 nucleic acid sequences (31) ] was chosen as the reference for this study because it is tightly integrated with the Pfam protein family database (23, 32 Taxon growth was estimated using a standard least squares method, with the SPSS statistical package. We have described a method that capitalizes on the Pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions. cache = ./cache/cord-325985-xfzhn1n1.txt txt = ./txt/cord-325985-xfzhn1n1.txt === reduce.pl bib === id = cord-268467-btfz6ye8 author = Schreiber, Steven S. title = Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E date = 1989-03-31 pages = extension = .txt mime = text/plain words = 5035 sentences = 343 flesch = 59 summary = The 3′-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the Coronavirus family and lends support to the theory that this region is important for the replication of negative-strand RNA. This result suggested that the HCV229E subgenomic mRNAs possess a nested-set structure similar to other coronaviruses and that A34 represented a cDNA clone of either the 3'-end of the genomic RNA or the leader sequence. The 3'-noncoding region contains the sequence TGGAAGAGCCA, 75 nucleotides from the 3'-end (Fig. 4) which is relatively conserved among coronaviruses and is found at approximately the same location in all of these viral genomes (Kapke and Brian, 1986; Skinner and Siddell, 1984; Armstrong et a/., 1983; Lapps et al., 1987; Kamahora et a/., 1988; Boursnell et al., 1985) ( Table 1) . Three intergenic regions of coronavirus mouse hepatitis virus strain A59 genome RNA contain a common nucleotide sequence that is homologous to the 3'end of the viral mRNA leader sequence cache = ./cache/cord-268467-btfz6ye8.txt txt = ./txt/cord-268467-btfz6ye8.txt === reduce.pl bib === id = cord-301827-a7hnuxy5 author = Uversky, Vladimir N title = A decade and a half of protein intrinsic disorder: Biology still waits for physics date = 2013-04-29 pages = extension = .txt mime = text/plain words = 20971 sentences = 1059 flesch = 43 summary = 94 Therefore, the abundance and peculiarities of the charged residues distribution within the protein sequences might determine physical and biological properties of extended IDPs and IDPRs. Also, simple polymer physics-based reasoning can give reasonably well-justified explanation of the conformational behavior of extended IDPs. In general, the conformational behavior of IDPs is characterized by the low cooperativity (or the complete lack thereof) of the denaturant-induced unfolding, lack of the measurable excess heat absorption peak(s) characteristic for the melting of ordered proteins, "turned out" response to heat and changes in pH, and the ability to gain structure in the presence of various binding partners. 183 This analysis revealed that proteins involved in regulation and execution of PCD possess substantial amount of intrinsic disorder and IDPRs were implemented in a number of crucial functions, such as protein-protein interactions, interactions with other partners including nucleic acids and other ligands, were shown to be enriched in post-translational modification sites, and were characterized by specific evolutionary patterns. cache = ./cache/cord-301827-a7hnuxy5.txt txt = ./txt/cord-301827-a7hnuxy5.txt === reduce.pl bib === id = cord-300149-djclli8n author = Ruan, Yijun title = Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection date = 2003-05-24 pages = extension = .txt mime = text/plain words = 4355 sentences = 226 flesch = 54 summary = title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection METHODS: We sequenced the entire SARS viral genome of cultured isolates from the index case (SIN2500) presenting in Singapore, from three primary contacts (SIN2774, SIN2748, and SIN2677), and one secondary contact (SIN2679). In addition, a common variant associated with a non-conservative aminoacid change in the S1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the SARS virus in human populations. All genetic variations of Singapore isolates identified when compared with available SARS-CoV genome sequences were further confirmed by primer extension genotyping technology (Sequenom, San Diego, CA, USA). These sequences showed that the genomes of SARS-CoV isolated in Singapore are comprised of 29 711 bases, with the exception of a five-nucleotide deletion in strain SIN2748 and a six-nucleotide deletion in SIN2677. cache = ./cache/cord-300149-djclli8n.txt txt = ./txt/cord-300149-djclli8n.txt === reduce.pl bib === id = cord-279528-41atidai author = Abo-Elkhier, Mervat M. title = Measuring Similarity among Protein Sequences Using a New Descriptor date = 2019-11-22 pages = extension = .txt mime = text/plain words = 3045 sentences = 217 flesch = 57 summary = Each amino acid in the protein sequence is represented by a number, and a new 2D graphical representation is suggested. A new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (A t , SA t ). e 2D graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in e 2D graphical representation of TGEVG from class I and GD03T0013 from SARS_CoV protein sequences is illustrated in Figures 4(a) and 4(b) respectively. A new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean A t and standard deviation SA t of the combined intensity level value A t (i) of the protein sequence. F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids cache = ./cache/cord-279528-41atidai.txt txt = ./txt/cord-279528-41atidai.txt === reduce.pl bib === id = cord-287658-c2lljdi7 author = Lopez-Rincon, Alejandro title = Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning date = 2020-09-10 pages = extension = .txt mime = text/plain words = 4766 sentences = 253 flesch = 55 summary = The discovered sequences are first validated on samples from other repositories, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. The discovered sequences are validated on samples from NCBI and GISAID, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. For example, we can use this sequencing data with cDNA, resulting from the PCR of the original viral RNA; e,g, Real-Time PCR amplicons to identify the SARS-CoV-2 16 . The global impact of SARS-CoV-2 prompted researchers to apply effective alignment-free methods to the classification of the virus: For example, in 26 the authors propose the use of Machine Learning Digital Signal Processing for separating the virus from similar strains, with remarkable accuracy. We calculated the frequency of appearance of different primer sets' sequences used in SARS-CoV-2 RT-PCR tests developed by WHO referral laboratories and compared it to our primer design in the dataset from the GISAID ( Table 2) repository. cache = ./cache/cord-287658-c2lljdi7.txt txt = ./txt/cord-287658-c2lljdi7.txt === reduce.pl bib === id = cord-287634-64zqe4cz author = Al-Ssulami, Abdulrakeeb M. title = CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents date = 2020-01-31 pages = extension = .txt mime = text/plain words = 2307 sentences = 137 flesch = 59 summary = For generating synthetic coding sequences with pre-specified amino acid sequence and desired GC-content, there exist two stochastic methods, multinomial and maximum entropy. In this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired GC-content. Thus, identifying over/under-represented regulatory elements or genome-scale patterns relies on generating random sequences that obey the pre-specified amino acid sequence and GC-content constraints. A more restricted method was presented recently, which the authors named NullSeq. NullSeq [10] uses the maximum entropy approach where the synonymous codon usage probability is derived from a strict function that expresses the expected GC-content in the reference amino acid sequence. We ran both tools, CodSeqGen and NullSeq [10] , to generate 1000 coding sequences given the primary amino acid sequence and the target GC-content of the reference coding sequence. NullSeq: a tool for generating random coding sequences with desired amino acid and GC contents cache = ./cache/cord-287634-64zqe4cz.txt txt = ./txt/cord-287634-64zqe4cz.txt === reduce.pl bib === id = cord-304869-l6a68tqn author = Bielińska-Wąż, Dorota title = Graphical and numerical representations of DNA sequences: statistical aspects of similarity date = 2011-08-28 pages = extension = .txt mime = text/plain words = 15408 sentences = 940 flesch = 60 summary = As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. How to restrict the graphs representing the sequences to two-dimensional plots and how to avoid degeneracies has been the subject of numerous studies which resulted in many graphical representations (see subsequent chapters). It is shown in the last chapter of this work that by using the four-component spectral representation one can recognize the difference in one base between a pair of sequences so it can be used for single nucleotide polymorfism (SNP) analyses which is subject of many investigation, as for example, in a recent work by Bhasi et al. cache = ./cache/cord-304869-l6a68tqn.txt txt = ./txt/cord-304869-l6a68tqn.txt === reduce.pl bib === id = cord-324216-ce3wa889 author = Wang, Zheng title = Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses date = 2008-12-01 pages = extension = .txt mime = text/plain words = 5206 sentences = 240 flesch = 49 summary = Due to the great genetic diversity of HRV and HEV, in order to ensure that designed probes (referred to as probe sequences) generated from selected database sequences (referred to as prototype regions) would detect and discriminate all serotypes of HRV and HEV, a predictive model was used to assist the microarray design [17] . This study demonstrated the use of an algorithm for the design of probe sets based on an in silico predictive model [17] , developed by our group, that minimized the probes needed for detection and identification of most serotypes of HRV and HEV. A powerful feature of the expanded RPM-Flu v.30/31 resequencing pathogen microarray is that the nucleotide sequences generated from hybridization of the sample RNA/DNA and array-bound probe sets in conjunction with previously developed sequence analysis algorithm CIBSI can be easily interpreted to make serotype or strain identifications. cache = ./cache/cord-324216-ce3wa889.txt txt = ./txt/cord-324216-ce3wa889.txt === reduce.pl bib === === reduce.pl bib === === reduce.pl bib === === reduce.pl bib === === reduce.pl bib === === reduce.pl bib === === reduce.pl bib === id = cord-023209-un2ysc2v author = nan title = Poster Presentations date = 2008-10-07 pages = extension = .txt mime = text/plain words = 111878 sentences = 5398 flesch = 45 summary = Site-specifi c PEGylation of human IgG1-Fab using a rationally designed trypsin variant In the present contribution we report on a novel, highly selective biocatalytic method enabling C-terminal modifi cations of proteins with artifi cial functionalities under native state conditions. Recently, our group report a novel approach to a totally synthetic vaccine which consists of FMDV (Foot and Mouth Disease Virus) VP1 peptides, prepared by covalent conjugation of peptide biomolecules with membrane active carbochain polyelectrolytes In the present study, peptide epitops of VP1 protein both 135-161(P1) amino acid residues (Ser-Lys-Tyr-Ser-Thr-Thr-Gly-Glu-Arg-Thr-Arg-Thr-Arg-Gly-Asp-Leu-Gly-Ala-Leu-Ala-Ala-Arg-Val-Ala-Thr-Gln-Leu-Pro-Ala) and triptophan (Trp) containing on the N terminus 135-161 amino acid residues (Trp-135-161) (P2) were synthesized by using the microwave assisted solid-phase methods. Using as a template a peptide, already identifi ed, with agonist activity against PTPRJ(H-[Cys-His-His-Asn-Leu-Thr-His-Ala-Cys]-OH), here we report a structure-activity study carried out through endocyclic modifi cations (Ala-scan, D-substitutions, single residue deletions, substitutions of the disulfi de bridge) and the preliminary biological results of this set of compounds. cache = ./cache/cord-023209-un2ysc2v.txt txt = ./txt/cord-023209-un2ysc2v.txt === reduce.pl bib === id = cord-004879-pgyzluwp author = nan title = Programmed cell death date = 1994 pages = extension = .txt mime = text/plain words = 81677 sentences = 4465 flesch = 51 summary = Furthermore kinetic experiments after complementation of HIV=RT p66 with KIV-RT pSl indicated that HIV-RT pSl can restore rate and extent of strand displacement activity by HIV-RT p66 compared to the HIV-RT heterodimer D66/D51, suggesting a function of the 51 kDa polypeptide, The mouse mammary tumor virus proviral DNA contains an open reading frame in the 3' long terminal repeat which can code for a 36 kDa polypeptide with a putative transmembrane sequence and five N-linked glycosylation sites. To this end we used constructs encoding the c-fos (and c-jun) genes fused to the hormone-binding domain of the human estrogen receptor, designated c-FosER (and c-JunER), We could show that short-term activation (30 mins.) of c-FosER by estradiole (E2) led to the disruption of epithelial cell polarity within 24 hours, as characterized by the expression of apical and basolateral marker proteins. cache = ./cache/cord-004879-pgyzluwp.txt txt = ./txt/cord-004879-pgyzluwp.txt === reduce.pl bib === === reduce.pl bib === === reduce.pl bib === id = cord-001835-0s7ok4uw author = nan title = Abstracts of the 29th Annual Symposium of The Protein Society date = 2015-10-01 pages = extension = .txt mime = text/plain words = 138514 sentences = 6150 flesch = 40 summary = Altogether, these results indicate that, although PHDs might be more selective for HIF as a substrate as it was initially thought, the enzymatic activity of the prolyl hydroxylases is possibly influenced by a number of other proteins that can directly bind to PHDs. Non-natural aminoacids via the MIO-enzyme toolkit Alina Filip 1 , Judith H Bartha-V ari 1 , Gergely B an oczy 2 , L aszl o Poppe 2 , Csaba Paizs 1 , Florin-Dan Irimie 1 1 Biocatalysis and Biotransformation Research Group, Department of Chemistry, UBB, 2 Department of Organic Chemistry and Technology An attractive enzymatic route to enantiomerically pure to the highly valuable a-or b-aromatic amino acids involves the use of aromatic ammonia lyases (ALs) and aminomutases (AMs). Continuing our studies of the effect of like-charged residues on protein-folding mechanisms, in this work, we investigated, by means of NMR spectroscopy and molecular-dynamics simulations, two short fragments of the human Pin1 WW domain [hPin1(14-24); hPin1(15-23)] and one single point mutation system derived from hPin1(14-24) in which the original charged residues were replaced with non-polar alanine residues. cache = ./cache/cord-001835-0s7ok4uw.txt txt = ./txt/cord-001835-0s7ok4uw.txt === reduce.pl bib === id = cord-326225-crtpzad7 author = Neill, John D. title = Simultaneous rapid sequencing of multiple RNA virus genomes date = 2014-06-01 pages = extension = .txt mime = text/plain words = 3804 sentences = 204 flesch = 55 summary = This procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3′-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. There is a wealth of information in these isolates, but up till now, it has been time consuming and expensive to sequence these viral genomes, often requiring sets of strain-specific primers for PCR amplification and sequencing. These primers were developed so that the 20 base known sequence was used for PCR amplification of the library as well as served as a barcode for identifying each viral library following pooling and sequencing. This virus, a BVDV 1b strain isolated from alpaca (GenBank accession JX297520.1; Table 2 , library 3, barcode 10), was assembled from Ion Torrent data and was found to have only 1 base difference from the sequence determined earlier (data not shown). One virus, library 1, barcode 9, had only 658 viral sequence reads but 94.4% of the genome was assembled. cache = ./cache/cord-326225-crtpzad7.txt txt = ./txt/cord-326225-crtpzad7.txt === reduce.pl bib === id = cord-328644-odtue60a author = Comandatore, Francesco title = Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes date = 2020-05-28 pages = extension = .txt mime = text/plain words = 6535 sentences = 301 flesch = 50 summary = These variants might arise during the spread of the epidemic, as viruses are known for their high frequency of mutation, particularly in single stranded RNA viruses -as in the case of SARS-CoV-2 (Sanjuán and Domingo-Calap 2016) , which has a single, positive-strand RNA genome. To have a better insight on the history and spread of the COVID-19 pandemic in Italy and thanks to the sequences deposited in the Gisaid database, we identified 7 non synonymous mutations that are differentially frequent in Italian SARS-CoV-2 strains respect to strains circulating globally. Our analysis allowed us to identify 7 positions in four proteins that present drastic changes in amino acid frequencies when comparing Italian sequences with worldwide sequences available on Gisaid.org on April, 10, 2020 ( Figure 1 ). cache = ./cache/cord-328644-odtue60a.txt txt = ./txt/cord-328644-odtue60a.txt === reduce.pl bib === id = cord-334394-qgyzk7th author = Edgar, Robert C. title = Petabase-scale sequence alignment catalyses viral discovery date = 2020-08-10 pages = extension = .txt mime = text/plain words = 8134 sentences = 423 flesch = 51 summary = To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To expand the known repertoire of viruses and catalyse global virus discovery, in particular for Coronaviridae (CoV) family, we developed the Serratus cloud computing architecture for ultra-high throughput sequence alignment. We aligned 3,837,755 public RNA-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all GenBank CoV records clustered at 99% identity plus all non-retroviral RefSeq records for vertebrate viruses (see Methods and Extended Table 1 ). We performed de novo assembly on 52,772 runs potentially containing CoV sequencing reads by combining 37,131 SRA accessions identified by the Serratus search with 18,584 identified by an ongoing cataloguing initiative of the SRA called STAT [5] . cache = ./cache/cord-334394-qgyzk7th.txt txt = ./txt/cord-334394-qgyzk7th.txt === reduce.pl bib === id = cord-331698-rwow1ydx author = Latorre-Pérez, Adriel title = A lab in the field: applications of real-time, in situ metagenomic sequencing date = 2020-08-20 pages = extension = .txt mime = text/plain words = 6732 sentences = 335 flesch = 36 summary = This review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis. Therefore, the ultra-portability, affordability, and speed in data production make the MinION technology suitable for real-time sequencing in a variety of environments, such as Ebola surveillance in West Africa during the last outbreak [25] , microbial communities inspection in the Arctic [26] , DNA sequencing on the International Space Station (ISS) [27] , and even the recently emerging pandemic coronavirus SARS-CoV-2 [28, 29] . In fact, there are some critical points to be addressed before this technique could become a standard in the industry: (i) sequencing cost should be reduced; (ii) rapid and reliable in situ DNA extraction and library preparation protocols should be designed and validated; (iii) minimal sequencing yields should be determined for each specific application; (iv) fast and real-time pipelines should be created and tested; and (v) level of expertise for managing the data and the samples should be notably reduced. cache = ./cache/cord-331698-rwow1ydx.txt txt = ./txt/cord-331698-rwow1ydx.txt === reduce.pl bib === id = cord-339209-oe8onyr9 author = Vasilakis, Nikos title = Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range date = 2014-05-20 pages = extension = .txt mime = text/plain words = 5817 sentences = 272 flesch = 46 summary = The organization of each genome was similar to that described previously for the mesoniviruses (NDiV, CavV, HanaV, NseV and MenoV), featuring a long 5'-untranslated region (5'-UTR) of 359 to 370 nt, six major long open reading frames (ORFs), and a long terminal region of 1780 to 1804 nt preceding the poly[A] tail ( Figure 2 ). To determine the phylogenetic relationships of the newly identified insect viruses, maximum likelihood (ML) phylogenetic trees were constructed based on the amino acid alignments of ORF2a (unprocessed S protein) and a concatenated region of the highly conserved domains within ORF1ab (3CL pro , RdRp and ZnHel1). A Clustal X alignment of the mesonivirus ORF3a proteins and individual structural analyses using SignalP and TMHMM and NetNGlyc (www.expasy.org) indicated that each is a class I transmembrane glycoprotein with a predicted N-termimal signal peptide, an ectodomain containing a conserved set of 6 cysteine residues and a single conserved N-glycosylation site, a transmembrane domain and a C-terminal cytoplasmic domain ( Figure 4A, 4D) . cache = ./cache/cord-339209-oe8onyr9.txt txt = ./txt/cord-339209-oe8onyr9.txt === reduce.pl bib === id = cord-334127-wjf8t8vp author = Brister, J. Rodney title = NCBI Viral Genomes Resource date = 2015-01-28 pages = extension = .txt mime = text/plain words = 3863 sentences = 186 flesch = 37 summary = This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. Whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes (11) (12) (13) , sequence assembly and virus discovery (14) (15) (16) (17) , viral dynamics and evolution (18) (19) (20) and pathogen detection (14, (21) (22) (23) . The second model captures and standardizes host information for all viruses, and whenever a new RefSeq record is created, a manually curated 'viral host' property is assigned to the relevant species within the NCBI Taxonomy database. The link to the Retrovirus Resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the Retrovirus Genotyping Tool and HIV-1, Human Interaction Database (50, 51) . cache = ./cache/cord-334127-wjf8t8vp.txt txt = ./txt/cord-334127-wjf8t8vp.txt === reduce.pl bib === id = cord-348427-worgd0xu author = Hatcher, Eneida L. title = Virus Variation Resource – improved response to emergent viral outbreaks date = 2017-01-04 pages = extension = .txt mime = text/plain words = 5552 sentences = 258 flesch = 48 summary = The resource now includes expanded data processing pipelines and analysis tools, and supports selection and retrieval of nucleotide and protein sequences from four new viral groups: Ebolaviruses, MERS coronavirus, rotavirus, and Zika virus ( Table 2 ). New processes have been added to parse source descriptor terms from Gen-Bank records and map these to controlled vocabulary, and the resource now supports retrieval of sequences based on standardized isolation source and host terms in addition to standardized gene and protein names. The resource includes data processing pipelines that retrieve sequences from GenBank, provide standardized gene and protein an-notation, and map sequence source descriptors (i.e. metadata) to uniform vocabularies. To resolve this issue, the Virus Variation database loading pipeline parses Gen-Bank records, identifies important metadata terms, such as sample isolation host, date, country and source, and maps these to a standardized vocabulary using a hierarchical approach. cache = ./cache/cord-348427-worgd0xu.txt txt = ./txt/cord-348427-worgd0xu.txt === reduce.pl bib === id = cord-340907-j9i1wlak author = Zarai, Yoram title = Evolutionary selection against short nucleotide sequences in viruses and their related hosts date = 2020-04-27 pages = extension = .txt mime = text/plain words = 8162 sentences = 415 flesch = 45 summary = Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. Figure 3A and B depicts the average number of under-represented sequences of size m ¼ 3, 4, and 5 nucleotides, identified in few subsets of viruses in both the original and random variants of the virus. A sampling analysis that we performed (see Supplementary document, Section 2.8) suggests that the number of under-represented sequences identified in dsDNA viruses matches their genomic size, when compared with RNA viruses. To show that the correspondence between selection against short palindromic sequences in viruses and restriction sites cannot be explained by basic coding region features such as amino-acid content and order, codon usage bias and dinucleotide distribution, we also evaluated the overlap between restriction sites and common under-represented sequences of random variants of viruses. cache = ./cache/cord-340907-j9i1wlak.txt txt = ./txt/cord-340907-j9i1wlak.txt === reduce.pl bib === id = cord-341564-fvuwick5 author = Qi, Zhao-Hui title = Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application date = 2018-06-12 pages = extension = .txt mime = text/plain words = 2647 sentences = 178 flesch = 54 summary = From these, we can see that physicochemical properties are widely applied with graphical representation of protein sequences by these researchers and their results seem well. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. Therefore, to mine essential information from a protein sequence, we propose an effective graphical method combining physicochemical properties of amino acids and the BLOSUM62 matrix. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids cache = ./cache/cord-341564-fvuwick5.txt txt = ./txt/cord-341564-fvuwick5.txt === reduce.pl bib === id = cord-330067-ujhgb3b0 author = Huang, Yi title = CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes date = 2007-10-02 pages = extension = .txt mime = text/plain words = 3007 sentences = 168 flesch = 55 summary = To overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, CoVDB (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. CoVDB provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. In CoVDB, with the aim of facilitating gene retrieval, we tried to unify the naming of these non-structural proteins from different groups of coronaviruses. When we compared their putative amino acid sequences to the corresponding ones in other group 1 coronavirus genomes using BLAST, as well as searching for conserved domains using motifscan, results showed that the putative proteins encoded by these ORFs belonged to a protein family in Pfam originally assigned as 'Corona_NS3b' (accession number PF03053). database, CoVDB, of annotated coronavirus genes and genomes, which offers efficient batch sequence retrieval and analysis. cache = ./cache/cord-330067-ujhgb3b0.txt txt = ./txt/cord-330067-ujhgb3b0.txt === reduce.pl bib === id = cord-345552-h6fwi0qn author = Li, Q.-G. title = Hydropathic characteristics of adenovirus hexons date = 1997-07-01 pages = extension = .txt mime = text/plain words = 3522 sentences = 206 flesch = 53 summary = The strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. The sequence of the predicted protein, consisting of 937 amino acids, was obtained with the LaserGene software program EditSeq. The hydropathy data of hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 were derived using the prediction method of Kyte-Doolittle in the LaserGene computer program Protean. The nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed serotypes of subgenera B, D and E to be closely related (Table 3 and Fig. 2) . DNA sequence of the adenovirus type 41 hexon gene and predicted structure of the protein cache = ./cache/cord-345552-h6fwi0qn.txt txt = ./txt/cord-345552-h6fwi0qn.txt === reduce.pl bib === id = cord-328259-3g4klpyg author = Guajardo-Leiva, Sergio title = Metagenomic Insights into the Sewage RNA Virosphere of a Large City date = 2020-09-21 pages = extension = .txt mime = text/plain words = 7626 sentences = 370 flesch = 47 summary = Despite the overrepresentation of dsRNA viruses, our results show that Santiago's sewage RNA virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). Viral sequences identified as Partitiviridae-like viruses included in the "unclassified RNA viruses ShiM-2016" category in the NCBI taxonomy (~25% abundance; Figure 2B ) and Totiviriade family were also highly abundant in treated and untreated sewage samples from the EU [5, 7] . Therefore, the abundance of these viruses in the Trebal metagenome can expand the known sequence space associated with this family (only 10 genomes are currently available in the NCBI database) and contribute to a better understanding of the bacteriophage biology related to RNA genomes. Taken together, our results show that metagenomic surveys of RNA viruses in sewage samples and the use of HMMs could uncover extraordinary viral diversity through the detection of remote homologs in these human-impacted environments. cache = ./cache/cord-328259-3g4klpyg.txt txt = ./txt/cord-328259-3g4klpyg.txt === reduce.pl bib === id = cord-330312-1pjolkql author = Liu, Y.-T. title = Infectious Disease Genomics date = 2017-01-20 pages = extension = .txt mime = text/plain words = 5168 sentences = 327 flesch = 45 summary = One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. 16, 17 The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002. 30e32 Genome-sequencing projects for other important human disease vectors are in progress. 38 One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. 48 The completed or ongoing genome projects (Table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. Genome sequence of the human malaria parasite Plasmodium falciparum cache = ./cache/cord-330312-1pjolkql.txt txt = ./txt/cord-330312-1pjolkql.txt === reduce.pl bib === id = cord-338207-60vrlrim author = Lefkowitz, E.J. title = Virus Databases date = 2008-07-30 pages = extension = .txt mime = text/plain words = 7957 sentences = 368 flesch = 48 summary = (Each arrow points to the table containing the primary key.) Tables are color-coded according to the source of the information they contain: yellow, data obtained from the original GenBank sequence record and the ICTV Eighth Report; pink, data obtained from automated annotation or manual curation; blue, controlled vocabularies to ensure data consistency; green, administrative data. While most of us store our BLAST search results as files on our desktop computers, it is useful to store this information within the database to provide rapid access to similarity results for comparative purposes; to use these results to assign genes to orthologous families of related sequences; and to use these results in applications that analyze data in the database and, for example, display the results of an analysis between two or more types of viruses showing shared sets of common genes. cache = ./cache/cord-338207-60vrlrim.txt txt = ./txt/cord-338207-60vrlrim.txt === reduce.pl bib === id = cord-354465-5nqrrnqr author = Haslinger, Christian title = RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties date = 1999 pages = extension = .txt mime = text/plain words = 10341 sentences = 756 flesch = 67 summary = Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. In case of one particular class of biopolymers, the ribonucleic acid (RNA) molecules, decoding of information stored in the sequence can be properly decomposed into two steps: (i) formation of the secondary structure, that is, of the pattern of Watson-Crick (and GU) base pairs, and (ii) the embedding of the contact structure in three-dimensional space. On the other hand, an increasing number of experimental findings, as well as results from comparative sequence analysis, suggest that pseudo-knots are important structural elements in many RNA molecules (Westhof and Jaeger, 1992) . cache = ./cache/cord-354465-5nqrrnqr.txt txt = ./txt/cord-354465-5nqrrnqr.txt === reduce.pl bib === id = cord-342785-55r01n0x author = Lemmon, Gordon H title = Predicting the sensitivity and specificity of published real-time PCR assays date = 2008-09-25 pages = extension = .txt mime = text/plain words = 4317 sentences = 239 flesch = 52 summary = METHODS: We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. This analysis must include the predicted false negative and false positive rates for the developed signatures, and consider all available public sequence data. A freely available real time PCR analysis tool called TaqSim [4] was used to find public sequences that would match the primer/probe assay in question. However, according to the genomic data available, a better match of primers and probes to target is possible and is usually desired for high sensitivity detection. Current real-time PCR assay design approaches produce signatures with sensitivities generally too low for clinical use. Fifty Seven TaqMan PCR primer/probe combinations we predict to have higher sensitivity/specificity than current published assays. Development of quantitative gene-specific real-time RT-PCR assays for the detection of measles virus in clinical specimens cache = ./cache/cord-342785-55r01n0x.txt txt = ./txt/cord-342785-55r01n0x.txt === reduce.pl bib === id = cord-344782-ond1ziu5 author = Zhang, Jing title = Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi) date = 2018-10-24 pages = extension = .txt mime = text/plain words = 6003 sentences = 280 flesch = 49 summary = Nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. Following the detection of the novel virus, in November 2015 (about 6 months after the cessation of the outbreak) an intensive survey of the parts of the river where affected turtles had been detected [2] was undertaken by groups of biologists and ecologists and samples collected from a wide range of aquatic species and some terrestrial animals (n = 360) to establish the size of the remaining population and whether any other animals were carrying this virus. BRV, as a novel nidovirus, was isolated from tissues of diseased animals, very high levels of viral RNA were detected in tissues with marked pathological changes and in situ hybridisation assays demonstrated the presence of specific viral RNA in lesions in kidneys and eye tissue-two of the main affected organs. cache = ./cache/cord-344782-ond1ziu5.txt txt = ./txt/cord-344782-ond1ziu5.txt === reduce.pl bib === id = cord-339915-8j04y50s author = Deng, Wei title = DV-Curve Representation of Protein Sequences and Its Application date = 2014-05-08 pages = extension = .txt mime = text/plain words = 2946 sentences = 176 flesch = 49 summary = Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins. In this paper, we introduce DV-curve graphical representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model of amino acids. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation New graphical representation of a DNA sequence based on the ordered dinucleotides and its application to sequence analysis Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation Similarity/dissimilarity studies of protein sequences based on a new 2d graphical representation cache = ./cache/cord-339915-8j04y50s.txt txt = ./txt/cord-339915-8j04y50s.txt === reduce.pl bib === id = cord-355075-ieb35upi author = Papenfuss, Anthony T title = The immune gene repertoire of an important viral reservoir, the Australian black flying fox date = 2012-06-20 pages = extension = .txt mime = text/plain words = 8952 sentences = 480 flesch = 54 summary = alecto transcriptome provides information on a variety of immune genes not previously identified in any bat species and represents an important starting point for examining the antiviral activity of these molecules. To enrich for sequences corresponding to cytokines and innate immune genes, the second dataset was derived from pooled total RNA obtained from mitogen-stimulated spleen, white blood cells and lymph node and unstimulated thymus and bone marrow obtained from one pregnant female and one adult male flying fox. A full length transcript, encoding a 667 amino acid protein was identified in our bat transcriptome datasets and found to be orthologous to Mx1 based on comparison with known mammalian Mx1 and Mx2 family members (Figure 4a and data not shown). Genes involved in the adaptive immune system, including MHC class I and II genes and T and B cell receptors and co-receptors were highly represented in both the thymus and pooled datasets providing evidence that bats have all of the components necessary to mount an adaptive immune response. cache = ./cache/cord-355075-ieb35upi.txt txt = ./txt/cord-355075-ieb35upi.txt === reduce.pl bib === id = cord-353290-1wi1dhv6 author = Kustin, Talia title = Biased mutation and selection in RNA viruses date = 2020-09-28 pages = extension = .txt mime = text/plain words = 7611 sentences = 402 flesch = 52 summary = We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses. Nevertheless, RNA viruses do share several common features that drive their evolution: (a) their ultimate dependence on the cell, (b) their high mutation rates, (c) strong purifying selection derived from constraints operating on a small and densely coding genome, and (d) sporadic but powerful positive selection driven by an evolutionary arms race with the host they infect. Two non-mutually exclusive hypotheses may be put forth to explain the consistent pattern of A-richness that we observe: there is selection for more A in viral sequences, and/or there is a mutational bias that leads to more A in genomes of viruses. cache = ./cache/cord-353290-1wi1dhv6.txt txt = ./txt/cord-353290-1wi1dhv6.txt === reduce.pl bib === id = cord-343863-q1y8uscj author = Whitney, Joe title = Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches date = 2005-02-08 pages = extension = .txt mime = text/plain words = 3463 sentences = 179 flesch = 61 summary = ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. The complete ReHAB hits database can then be queried by date using a simple GUI to allow the researcher to easily identify new hits; these are highlighted, and pairwise or multiple alignments can be performed to assess the quality of the match. ReHAB consists of four main components ( Figure 1 ): (1) a MySQL relational database that stores information about hits, including biological sequences, alignments between them, and other categorization and annotation data; (2) a Java server that provides access to programs which cannot be run locally by the client on arbitrary user workstations, such as NCBI BLAST and EMBOSS [12] utilities; (3) a Java Swing graphical client, downloaded and launched on client machines using Java Web Start; (4) and a back-end Java program which runs alignment programs and compiles results in the database. cache = ./cache/cord-343863-q1y8uscj.txt txt = ./txt/cord-343863-q1y8uscj.txt === reduce.pl bib === id = cord-341879-vubszdp2 author = Li, Lucy M title = Genomic analysis of emerging pathogens: methods, application and future trends date = 2014-11-22 pages = extension = .txt mime = text/plain words = 5029 sentences = 253 flesch = 36 summary = In this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases. In this review, we provide an overview of recent developments in genomic methods in the context of infectious diseases, evaluate integrative methods that incorporate genetic data in epidemiological analysis, and discuss the application of these methods to EIDs. Over the last two decades, sequence data have increased in quality, length and volume due to improvements in the underlying technology and decreasing costs. In recent cases of EIDs, genomic data have helped to classify and characterize the pathogen, uncover the population history of the disease, and produce estimates of epidemiological parameters. Just as compartmental models can be fitted to surveillance data to infer the epidemiological dynamics of an infectious disease (Box 1), the coalescent framework allows inference of population history from pathogen sequences. cache = ./cache/cord-341879-vubszdp2.txt txt = ./txt/cord-341879-vubszdp2.txt ===== Reducing email addresses cord-035033-osjy88rc cord-265857-fs6dj3dp cord-263987-ff6kor0c cord-321386-u1imic5l cord-267500-x3u9i1vq cord-321150-ev6acl7b cord-001835-0s7ok4uw cord-348427-worgd0xu Creating transaction Updating adr table ===== Reducing keywords cord-000257-ampip7od cord-016293-pyb00pt5 cord-016798-tv2ntug6 cord-025610-7vouj8pp cord-000473-jpow6iw1 cord-025948-6dsx7pey cord-014674-ey29970v cord-004862-yv76yvy5 cord-018459-isbc1r2o cord-015850-ef6svn8f cord-012975-u87ol3fs cord-033010-o5kiadfm cord-256608-ajzk86rq cord-103029-nc5yf6x4 cord-010260-8lnpujip cord-001340-kqcx7lrq cord-010161-bcuec2fz cord-002473-2kpxhzbe cord-005060-n901y2d4 cord-017584-9rx4jlw8 cord-011565-8ncgldaq cord-001537-i34vmfpp cord-256278-jvfjf7aw cord-103297-4stnx8dw cord-000642-mkwpuav6 cord-255194-4i9fc0r7 cord-016594-lj0us1dq cord-023647-dlqs8ay9 cord-264296-0x90yubt cord-022348-w7z97wir cord-264135-s2u76pvk cord-266288-buc4dd5y cord-203232-1nnqx1g9 cord-035033-osjy88rc cord-266960-kyx6xhvj cord-018133-2otxft31 cord-003316-r5te5xob cord-001786-ybd8hi8y cord-300796-rmjv56ia cord-017932-vmtjc8ct cord-265857-fs6dj3dp cord-010499-yefxrj30 cord-263987-ff6kor0c cord-010273-0c56x9f5 cord-022494-d66rz6dc cord-193910-7p3f3znj cord-017354-cndb031c cord-253436-dz84icdc cord-255371-o9oxchq6 cord-014462-11ggaqf1 cord-268549-2lg8i9r1 cord-014461-2ubh9u8r cord-001974-wjf3c7a7 cord-306725-0vam15pt cord-321386-u1imic5l cord-023208-w99gc5nx cord-275258-azpg5yrh cord-027316-echxuw74 cord-264746-gfn312aa cord-213136-euv6pqh5 cord-031957-df4luh5v cord-321715-bkfkmtld cord-193356-hqbstgg7 cord-252347-vnn4135b cord-311240-o0zyt2vb cord-018963-2lia97db cord-267500-x3u9i1vq cord-102766-n6mpdhyu cord-311839-61djk4bs cord-254942-g51mjj2b cord-302798-q0mbngqy cord-321150-ev6acl7b cord-321762-7kiahjyy cord-266794-oyppubq5 cord-300807-9u8idlon cord-325985-xfzhn1n1 cord-280881-5o38ihe0 cord-274056-9t3kneoo cord-301827-a7hnuxy5 cord-279528-41atidai cord-287658-c2lljdi7 cord-300149-djclli8n cord-268467-btfz6ye8 cord-287634-64zqe4cz cord-310734-6v7oru2l cord-304869-l6a68tqn cord-324216-ce3wa889 cord-291156-zxg3dsm3 cord-296691-cg463fbn cord-302161-ytr7ds8i cord-023209-un2ysc2v cord-325043-vqjhiv7p cord-004879-pgyzluwp cord-325750-x7jpsnxg cord-001835-0s7ok4uw cord-326225-crtpzad7 cord-328644-odtue60a cord-324021-y1vr1db0 cord-334394-qgyzk7th cord-331698-rwow1ydx cord-338207-60vrlrim cord-330067-ujhgb3b0 cord-334127-wjf8t8vp cord-348427-worgd0xu cord-339209-oe8onyr9 cord-345552-h6fwi0qn cord-340907-j9i1wlak cord-304607-td0776wj cord-341564-fvuwick5 cord-328259-3g4klpyg cord-354465-5nqrrnqr cord-342785-55r01n0x cord-330312-1pjolkql cord-339915-8j04y50s cord-355075-ieb35upi cord-344782-ond1ziu5 cord-353290-1wi1dhv6 cord-343863-q1y8uscj cord-341879-vubszdp2 Creating transaction Updating wrd table ===== Reducing urls cord-016798-tv2ntug6 cord-025948-6dsx7pey cord-000473-jpow6iw1 cord-015850-ef6svn8f cord-033010-o5kiadfm cord-256608-ajzk86rq cord-002473-2kpxhzbe cord-011565-8ncgldaq cord-001537-i34vmfpp cord-256278-jvfjf7aw cord-103297-4stnx8dw cord-000642-mkwpuav6 cord-016594-lj0us1dq cord-264135-s2u76pvk cord-264296-0x90yubt cord-266288-buc4dd5y cord-018133-2otxft31 cord-003316-r5te5xob cord-017932-vmtjc8ct cord-022494-d66rz6dc cord-255371-o9oxchq6 cord-001974-wjf3c7a7 cord-275258-azpg5yrh cord-306725-0vam15pt cord-193356-hqbstgg7 cord-264746-gfn312aa cord-267500-x3u9i1vq cord-311240-o0zyt2vb cord-311839-61djk4bs cord-018963-2lia97db cord-102766-n6mpdhyu cord-254942-g51mjj2b cord-321150-ev6acl7b cord-302798-q0mbngqy cord-266794-oyppubq5 cord-280881-5o38ihe0 cord-274056-9t3kneoo cord-325985-xfzhn1n1 cord-301827-a7hnuxy5 cord-300149-djclli8n cord-296691-cg463fbn cord-324216-ce3wa889 cord-302161-ytr7ds8i cord-291156-zxg3dsm3 cord-310734-6v7oru2l cord-304607-td0776wj cord-325750-x7jpsnxg cord-001835-0s7ok4uw cord-326225-crtpzad7 cord-328644-odtue60a cord-334394-qgyzk7th cord-330067-ujhgb3b0 cord-339209-oe8onyr9 cord-334127-wjf8t8vp cord-348427-worgd0xu cord-354465-5nqrrnqr cord-341564-fvuwick5 cord-328259-3g4klpyg cord-342785-55r01n0x cord-344782-ond1ziu5 cord-355075-ieb35upi cord-353290-1wi1dhv6 Creating transaction Updating url table ===== Reducing named entities cord-000257-ampip7od cord-016798-tv2ntug6 cord-000473-jpow6iw1 cord-016293-pyb00pt5 cord-025610-7vouj8pp cord-014674-ey29970v cord-025948-6dsx7pey cord-004862-yv76yvy5 cord-018459-isbc1r2o cord-015850-ef6svn8f cord-012975-u87ol3fs cord-033010-o5kiadfm cord-256608-ajzk86rq cord-103029-nc5yf6x4 cord-001340-kqcx7lrq cord-002473-2kpxhzbe cord-010260-8lnpujip cord-010161-bcuec2fz cord-017584-9rx4jlw8 cord-005060-n901y2d4 cord-001537-i34vmfpp cord-011565-8ncgldaq cord-103297-4stnx8dw cord-256278-jvfjf7aw cord-000642-mkwpuav6 cord-255194-4i9fc0r7 cord-023647-dlqs8ay9 cord-016594-lj0us1dq cord-022348-w7z97wir cord-264296-0x90yubt cord-264135-s2u76pvk cord-203232-1nnqx1g9 cord-266288-buc4dd5y cord-035033-osjy88rc cord-266960-kyx6xhvj cord-001786-ybd8hi8y cord-003316-r5te5xob cord-018133-2otxft31 cord-017932-vmtjc8ct cord-300796-rmjv56ia cord-265857-fs6dj3dp cord-010273-0c56x9f5 cord-010499-yefxrj30 cord-263987-ff6kor0c cord-022494-d66rz6dc cord-193910-7p3f3znj cord-253436-dz84icdc cord-255371-o9oxchq6 cord-017354-cndb031c cord-014461-2ubh9u8r cord-268549-2lg8i9r1 cord-275258-azpg5yrh cord-001974-wjf3c7a7 cord-027316-echxuw74 cord-014462-11ggaqf1 cord-321386-u1imic5l cord-306725-0vam15pt cord-213136-euv6pqh5 cord-252347-vnn4135b cord-264746-gfn312aa cord-193356-hqbstgg7 cord-267500-x3u9i1vq cord-311240-o0zyt2vb cord-031957-df4luh5v cord-321715-bkfkmtld cord-311839-61djk4bs cord-018963-2lia97db cord-321762-7kiahjyy cord-102766-n6mpdhyu cord-254942-g51mjj2b cord-321150-ev6acl7b cord-302798-q0mbngqy cord-266794-oyppubq5 cord-300807-9u8idlon cord-023208-w99gc5nx cord-280881-5o38ihe0 cord-274056-9t3kneoo cord-325985-xfzhn1n1 cord-279528-41atidai cord-300149-djclli8n cord-268467-btfz6ye8 cord-287658-c2lljdi7 cord-301827-a7hnuxy5 cord-304869-l6a68tqn cord-287634-64zqe4cz cord-324216-ce3wa889 cord-296691-cg463fbn cord-302161-ytr7ds8i cord-291156-zxg3dsm3 cord-304607-td0776wj cord-310734-6v7oru2l cord-325043-vqjhiv7p cord-325750-x7jpsnxg cord-324021-y1vr1db0 cord-326225-crtpzad7 cord-328644-odtue60a cord-334394-qgyzk7th cord-331698-rwow1ydx cord-338207-60vrlrim cord-330067-ujhgb3b0 cord-341564-fvuwick5 cord-345552-h6fwi0qn cord-334127-wjf8t8vp cord-348427-worgd0xu cord-340907-j9i1wlak cord-339209-oe8onyr9 cord-342785-55r01n0x cord-328259-3g4klpyg cord-354465-5nqrrnqr cord-344782-ond1ziu5 cord-330312-1pjolkql cord-339915-8j04y50s cord-355075-ieb35upi cord-343863-q1y8uscj cord-341879-vubszdp2 cord-353290-1wi1dhv6 cord-004879-pgyzluwp cord-023209-un2ysc2v cord-001835-0s7ok4uw Creating transaction Updating ent table ===== Reducing parts of speech cord-000257-ampip7od cord-025610-7vouj8pp cord-000473-jpow6iw1 cord-014674-ey29970v cord-016798-tv2ntug6 cord-018459-isbc1r2o cord-004862-yv76yvy5 cord-012975-u87ol3fs cord-025948-6dsx7pey cord-256608-ajzk86rq cord-015850-ef6svn8f cord-001340-kqcx7lrq cord-033010-o5kiadfm cord-002473-2kpxhzbe cord-103029-nc5yf6x4 cord-017584-9rx4jlw8 cord-010161-bcuec2fz cord-005060-n901y2d4 cord-001537-i34vmfpp cord-256278-jvfjf7aw cord-255194-4i9fc0r7 cord-023647-dlqs8ay9 cord-000642-mkwpuav6 cord-016293-pyb00pt5 cord-264296-0x90yubt cord-264135-s2u76pvk cord-203232-1nnqx1g9 cord-001786-ybd8hi8y cord-011565-8ncgldaq cord-266288-buc4dd5y cord-010260-8lnpujip cord-022348-w7z97wir cord-035033-osjy88rc cord-016594-lj0us1dq cord-018133-2otxft31 cord-265857-fs6dj3dp cord-266960-kyx6xhvj cord-300796-rmjv56ia cord-003316-r5te5xob cord-017932-vmtjc8ct cord-010273-0c56x9f5 cord-010499-yefxrj30 cord-263987-ff6kor0c cord-193910-7p3f3znj cord-253436-dz84icdc cord-255371-o9oxchq6 cord-014461-2ubh9u8r cord-268549-2lg8i9r1 cord-275258-azpg5yrh cord-022494-d66rz6dc cord-001974-wjf3c7a7 cord-306725-0vam15pt cord-321386-u1imic5l cord-027316-echxuw74 cord-213136-euv6pqh5 cord-017354-cndb031c cord-103297-4stnx8dw cord-252347-vnn4135b cord-267500-x3u9i1vq cord-311240-o0zyt2vb cord-102766-n6mpdhyu cord-321150-ev6acl7b cord-266794-oyppubq5 cord-300807-9u8idlon cord-311839-61djk4bs cord-264746-gfn312aa cord-321715-bkfkmtld cord-254942-g51mjj2b cord-321762-7kiahjyy cord-302798-q0mbngqy cord-280881-5o38ihe0 cord-274056-9t3kneoo cord-325985-xfzhn1n1 cord-031957-df4luh5v cord-279528-41atidai cord-300149-djclli8n cord-268467-btfz6ye8 cord-287658-c2lljdi7 cord-287634-64zqe4cz cord-324216-ce3wa889 cord-296691-cg463fbn cord-291156-zxg3dsm3 cord-018963-2lia97db cord-304607-td0776wj cord-302161-ytr7ds8i cord-325043-vqjhiv7p cord-310734-6v7oru2l cord-014462-11ggaqf1 cord-325750-x7jpsnxg cord-328644-odtue60a cord-304869-l6a68tqn cord-301827-a7hnuxy5 cord-193356-hqbstgg7 cord-324021-y1vr1db0 cord-334394-qgyzk7th cord-326225-crtpzad7 cord-331698-rwow1ydx cord-338207-60vrlrim cord-330067-ujhgb3b0 cord-334127-wjf8t8vp cord-339209-oe8onyr9 cord-348427-worgd0xu cord-345552-h6fwi0qn cord-341564-fvuwick5 cord-340907-j9i1wlak cord-330312-1pjolkql cord-342785-55r01n0x cord-328259-3g4klpyg cord-339915-8j04y50s cord-344782-ond1ziu5 cord-343863-q1y8uscj cord-341879-vubszdp2 cord-355075-ieb35upi cord-353290-1wi1dhv6 cord-354465-5nqrrnqr cord-023208-w99gc5nx cord-004879-pgyzluwp cord-023209-un2ysc2v cord-001835-0s7ok4uw Creating transaction Updating pos table Building ./etc/reader.txt cord-001835-0s7ok4uw cord-301827-a7hnuxy5 cord-023209-un2ysc2v cord-023209-un2ysc2v cord-023208-w99gc5nx cord-001835-0s7ok4uw number of items: 118 sum of words: 1,037,270 average size in words: 9,973 average readability score: 51 nouns: sequence; sequences; protein; proteins; virus; structure; data; analysis; genome; peptides; dna; gene; peptide; number; acid; cell; viruses; amino; cells; results; methods; activity; method; genes; model; alignment; structures; information; time; study; species; sequencing; studies; residues; region; acids; database; approach; function; type; genomes; domain; similarity; disease; samples; receptor; length; group; order; expression verbs: used; shown; based; bind; found; contain; identifying; including; provided; known; obtaining; represented; determined; compare; suggests; given; generating; developed; indicate; increasing; following; performed; describe; see; predict; allowing; involved; make; revealed; leads; associated; form; studying; considered; observe; reported; detecting; result; produce; required; propose; related; expressed; induced; characterize; isolated; cause; investigated; defined; applied adjectives: different; viral; new; human; high; specific; molecular; structural; many; large; similar; important; several; first; biological; single; multiple; novel; non; immune; small; functional; possible; available; nucleotide; various; present; low; genetic; genomic; phylogenetic; common; secondary; positive; active; major; complete; higher; like; particular; short; potential; unique; evolutionary; dependent; clinical; long; free; amino; natural adverbs: also; however; well; highly; therefore; respectively; previously; even; recently; often; furthermore; first; now; currently; still; together; directly; far; rather; finally; much; significantly; specifically; moreover; closely; relatively; less; especially; generally; clearly; widely; usually; approximately; already; almost; yet; subsequently; randomly; hence; completely; fully; additionally; instead; interestingly; strongly; rapidly; potentially; particularly; typically; successfully pronouns: we; it; their; its; our; they; i; them; us; his; one; he; itself; themselves; your; my; you; her; him; she; me; ourselves; yÞ; mine; l1oc; himself; s; ppifs; p53-mdm2; p450s; n40np; ifnyr-/-mice; https://github.com/ababaian/serratus; em; cb562; ³hser; yegfp; y_~; y401; y; w@; u; tlg1; sod-3::gfp; pgem2dhfr; p110a; ours; organotyp[c; n−3; nthash proper nouns: RNA; C; Fig; SARS; PCR; Table; A; N; DNA; T; S; Genome; University; NMR; II; DeepRC; M; ±; NCBI; Protein; CoV-2; HCV; B; L; fl; HIV; K; E.; D; GenBank; Virus; India; Human; bp; LSTM; Institute; F; CNN; RT; China; MS; Gly; G; C.; novo; mRNA; Hopfield; L1; Analysis; CoV keywords: sequence; rna; dna; protein; virus; genome; structure; sars; gene; model; pcr; viral; study; human; cell; acid; university; table; result; peptide; nmr; high; disease; bind; activity; sequencing; receptor; plant; ncbi; method; interaction; india; cnn; cmv; vaccine; tyr; site; residue; read; probe; pro; phylogenetic; orf; mutation; mil; mhc; metagenomic; lys; lstm; isolate one topic; one dimension: sequence file(s): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945003/ titles(s): The Nature of Protein Domain Evolution: Shaping the Interaction Network three topics; one dimension: sequence; protein; virus file(s): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7261164/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7167823/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3639731/ titles(s): To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics | Poster Presentations | Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh five topics; three dimensions: sequence sequences virus; protein proteins binding; peptide peptides activity; sequence sequences protein; structures secondary rna file(s): https://doi.org/10.3390/v12040422, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7087532/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7167823/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7123984/, https://www.ncbi.nlm.nih.gov/pubmed/17883226/ titles(s): A Preliminary Study of the Virome of the South American Free-Tailed Bats (Tadarida brasiliensis) and Identification of Two Novel Mammalian Viruses | Programmed cell death | Poster Presentations | Protein Structure Prediction by Protein Threading | RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties Type: cord title: keyword-sequence-cord date: 2021-05-25 time: 16:43 username: emorgan patron: Eric Morgan email: emorgan@nd.edu input: keywords:sequence ==== make-pages.sh htm files ==== make-pages.sh complex files ==== make-pages.sh named enities ==== making bibliographics id: cord-274056-9t3kneoo author: Abd Elwahaab, Marwa A. title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector date: 2019-05-08 words: 3314.0 sentences: 251.0 pages: flesch: 59.0 cache: ./cache/cord-274056-9t3kneoo.txt txt: ./txt/cord-274056-9t3kneoo.txt summary: title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector For beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in Table 1 . The similarity/dissimilarity vectors that are corresponding to beta globin, ND5, and spike protein sequences are illustrated in Tables 9, 10, and 11, respectively, based on the two methods discussed before. The results in Table 10 show that both the magnitude ( 5 ) and the angle ( 5 ) can measure similarity/dissimilarity degree well among ND5 protein sequences as shown in Figure 2 . The similarity/dissimilarity analysis among the seven beta globin sequences measured according to ( 5 ) is illustrated in Table 12 and shown in Figure 4 . The similarity/dissimilarity analysis among the beta globin sequences measured according to (GR spike ) is illustrated in Table 14 and shown in Figure 6 . abstract: Similarity/dissimilarity analysis is a key way of understanding the biology of an organism by knowing the origin of the new genes/sequences. Sequence data are grouped in terms of biological relationships. The number of sequences related to any group is susceptible to be increased every day. All the present alignment-free methods approve the utility of their approaches by producing a similarity/dissimilarity matrix. Although this matrix is clear, it measures the degree of similarity among sequences individually. In our work, a representative of each of three groups of protein sequences is introduced. A similarity/dissimilarity vector is evaluated instead of the ordinary similarity/dissimilarity matrix based on the group representative. The approach is applied on three selected groups of protein sequences: beta globin, NADH dehydrogenase subunit 5 (ND5), and spike protein sequences. A cross-grouping comparison is produced to ensure the singularity of each group. A qualitative comparison between our approach, previous articles, and the phylogenetic tree of these protein sequences proved the utility of our approach. url: https://doi.org/10.1155/2019/8702968 doi: 10.1155/2019/8702968 id: cord-279528-41atidai author: Abo-Elkhier, Mervat M. title: Measuring Similarity among Protein Sequences Using a New Descriptor date: 2019-11-22 words: 3045.0 sentences: 217.0 pages: flesch: 57.0 cache: ./cache/cord-279528-41atidai.txt txt: ./txt/cord-279528-41atidai.txt summary: Each amino acid in the protein sequence is represented by a number, and a new 2D graphical representation is suggested. A new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (A t , SA t ). e 2D graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in e 2D graphical representation of TGEVG from class I and GD03T0013 from SARS_CoV protein sequences is illustrated in Figures 4(a) and 4(b) respectively. A new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean A t and standard deviation SA t of the combined intensity level value A t (i) of the protein sequence. F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids abstract: The comparison of protein sequences according to similarity is a fundamental aspect of today's biomedical research. With the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. Famous sequences' comparison methods are alignment based. They generally give excellent results when the sequences under study are closely related and they are time consuming. Herein, a new alignment-free method is introduced. Our technique depends on a new graphical representation and descriptor. The graphical representation of protein sequence is a simple way to visualize protein sequences. The descriptor compresses the primary sequence into a single vector composed of only two values. Our approach gives good results with both short and long sequences within a little computation time. It is applied on nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 spike protein sequences. Correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others' approaches, results, and sequence homology. url: https://www.ncbi.nlm.nih.gov/pubmed/31886192/ doi: 10.1155/2019/2796971 id: cord-287634-64zqe4cz author: Al-Ssulami, Abdulrakeeb M. title: CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents date: 2020-01-31 words: 2307.0 sentences: 137.0 pages: flesch: 59.0 cache: ./cache/cord-287634-64zqe4cz.txt txt: ./txt/cord-287634-64zqe4cz.txt summary: For generating synthetic coding sequences with pre-specified amino acid sequence and desired GC-content, there exist two stochastic methods, multinomial and maximum entropy. In this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired GC-content. Thus, identifying over/under-represented regulatory elements or genome-scale patterns relies on generating random sequences that obey the pre-specified amino acid sequence and GC-content constraints. A more restricted method was presented recently, which the authors named NullSeq. NullSeq [10] uses the maximum entropy approach where the synonymous codon usage probability is derived from a strict function that expresses the expected GC-content in the reference amino acid sequence. We ran both tools, CodSeqGen and NullSeq [10] , to generate 1000 coding sequences given the primary amino acid sequence and the target GC-content of the reference coding sequence. NullSeq: a tool for generating random coding sequences with desired amino acid and GC contents abstract: Abstract Identification of regulatory elements is essential for understanding the mechanism behind regulating gene expression. These regulatory elements—located in or near gene—bind to proteins called transcription factors to initiate the transcription process. Their occurrences are influenced by the GC-content or nucleotide composition. For generating synthetic coding sequences with pre-specified amino acid sequence and desired GC-content, there exist two stochastic methods, multinomial and maximum entropy. Both methods rely on the probability of choosing the codon synonymous for usage in regard to a specific amino acid. In spite the latter exhibited unbiased manner, the produced sequences are not exactly obeying the GC-content constraint. In this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired GC-content. The proposed tool, namely CodSeqGen, depends on random selection for smaller subsets to be traversed using the backtracking approach. url: https://doi.org/10.1016/j.ygeno.2019.02.002 doi: 10.1016/j.ygeno.2019.02.002 id: cord-102766-n6mpdhyu author: Alam, Md. Nafis Ul title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses date: 2020-06-25 words: 3193.0 sentences: 192.0 pages: flesch: 56.0 cache: ./cache/cord-102766-n6mpdhyu.txt txt: ./txt/cord-102766-n6mpdhyu.txt summary: title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. VirFinder: a novel k-mer based tool for identifying viral sequences from 558 assembled metagenomic data. abstract: High throughout sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de-novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data. Author Summary In this age of high-throughput sequencing, proper classification of copious amounts of sequence data remains to be a daunting challenge. Presently, sequence alignment methods are immediately assigned to the task. Owing to the selection forces of nature, there is considerable homology even between the sequences of different species which draws ambiguity to the results of alignment-based searches. Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. These features were able to accurately distinguish virus RNA from human transcripts with performance scores better than all previous reports. Our models were able to generalize well to distant species of viruses and mouse transcripts. The model correctly classifies the majority of false hits generated by current standard alignment tools. These findings strongly imply that this k-mer score based computational pipeline forges a highly informative, rich set of numerical machine learning features and similar pipelines can greatly advance the field of computational biology. url: https://doi.org/10.1101/2020.06.25.170779 doi: 10.1101/2020.06.25.170779 id: cord-018133-2otxft31 author: Altman, Russ B. title: Bioinformatics date: 2006 words: 9592.0 sentences: 462.0 pages: flesch: 46.0 cache: ./cache/cord-018133-2otxft31.txt txt: ./txt/cord-018133-2otxft31.txt summary: Experimentation and bioinformatics have divided the research into several areas, and the largest are: (1) genome and protein sequence analysis, (2) macromolecular structure-function analysis, (3) gene expression analysis, and (4) proteomics. With the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. The Entrez system from the National Center for Biological Information (NCBI) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the Human Genome Project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources. abstract: Why is sequence, structure, and biological pathway information relevant to medicine? Where on the Internet should you look for a DNA sequence, a protein sequence, or a protein structure? What are two problems encountered in analyzing biological sequence, structure, and function? How has the age of genomics changed the landscape of bioinformatics? What two changes should we anticipate in the medical record as a result of these new information sources? What are two computational challenges in bioinformatics for the future? url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7122933/ doi: 10.1007/0-387-36278-9_22 id: cord-010260-8lnpujip author: Anthonsen, Henrik W. title: The blind watchmaker and rational protein engineering date: 1994-08-31 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: In the present review some scientific areas of key importance for protein engineering are discussed, such as problems involved in deducting protein sequence from DNA sequence (due to posttranscriptional editing, splicing and posttranslational modifications), modelling of protein structures by homology, NMR of large proteins (including probing the molecular surface with relaxation agents), simulation of protein structures by molecular dynamics and simulation of electrostatic effects in proteins (including pH-dependent effects). It is argued that all of these areas could be of key importance in most protein engineering projects, because they give access to increased and often unique information. In the last part of the review some potential areas for future applications of protein engineering approaches are discussed, such as non-conventional media, de novo design and nanotechnology. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7173218/ doi: 10.1016/0168-1656(94)90152-x id: cord-000473-jpow6iw1 author: Astrovskaya, Irina title: Inferring viral quasispecies spectra from 454 pyrosequencing reads date: 2011-07-28 words: 5363.0 sentences: 296.0 pages: flesch: 54.0 cache: ./cache/cord-000473-jpow6iw1.txt txt: ./txt/cord-000473-jpow6iw1.txt summary: High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Given a collection of 454 pyrosequencing reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population. abstract: BACKGROUND: RNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. RESULTS: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Experimental results show that ViSpA outperforms ShoRAH on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. While ShoRAH has a significant advantage over ViSpA on reads simulated with sequencing errors due to its advanced error correction algorithm, ViSpA is better at assembling the simulated reads after they have been corrected by ShoRAH. ViSpA also outperforms ShoRAH on real 454 reads. Indeed, 7 most frequent sequences reconstructed by ViSpA from a real HCV dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and Sanger sequencing. In contrast, only one of the sequences reconstructed by ShoRAH is viable. On a real HIV dataset, ShoRAH correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas ViSpA correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. ViSpA source code is available at http://alla.cs.gsu.edu/~software/VISPA/vispa.html. CONCLUSIONS: ViSpA enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. We are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3194189/ doi: 10.1186/1471-2105-12-s6-s1 id: cord-035033-osjy88rc author: Aydin, Berkay title: Spatiotemporal event sequence discovery without thresholds date: 2020-11-09 words: 8231.0 sentences: 430.0 pages: flesch: 54.0 cache: ./cache/cord-035033-osjy88rc.txt txt: ./txt/cord-035033-osjy88rc.txt summary: Here, we introduce a novel algorithm, RAND-ESMINER, which, by randomly repeating the mining process on a random subset of instances and follow relationships, finds an estimate participation index for event sequences. The RAND-ESMINER uses our pattern growth-based ESGROWTH algorithm [4] as the backbone, where the follow relationships are translated into a directed acyclic graph structure, and randomly permutes the edges of this graph to mine the event sequences. They defined a follow relation between the pointbased event instances of two different event types, presented significance measures for sequences, and introduced two pattern-growth based algorithms for the mining task. In this paper, we will focus on mining STESs using a randomization approach, which will take a set of spatiotemporal event instances as input and returns all the discovered STESs together with a list of estimated participation index values for each STES, obtained from randomized trials. abstract: Spatiotemporal event sequences (STESs) are the ordered series of event types whose instances frequently follow each other in time and are located close-by. An STES is a spatiotemporal frequent pattern type, which is discovered from moving region objects whose polygon-based locations continiously evolve over time. Previous studies on STES mining require significance and prevalence thresholds for the discovery, which is usually unknown to domain experts. The quality of the discovered sequences is of great importance to the domain experts who use these algorithms. We introduce a novel algorithm to find the most relevant STESs without threshold values. We tested the relevance and performance of our threshold-free algorithm with a case study on solar event metadata, and compared the results with the previous STES mining algorithms. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7649715/ doi: 10.1007/s10707-020-00427-6 id: cord-000257-ampip7od author: Bagowski, Christoph P title: The Nature of Protein Domain Evolution: Shaping the Interaction Network date: 2010-08-17 words: 4678.0 sentences: 249.0 pages: flesch: 43.0 cache: ./cache/cord-000257-ampip7od.txt txt: ./txt/cord-000257-ampip7od.txt summary: With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. This likely stems from the fact that they are required to participate in many different interactions, which makes selection pressures more stringent and the appearance of the branches on phylogenetic trees relatively short and more difficult to assess when co-evolutionary data in terms of other domains in the same gene family or expression patterns is limited [42, 63] . This approach thus primarily focuses on the similarity and differences of the orthologous genes within network, and is therefore ideally suited for the study of protein domain evolution and has already revealed that species-specific parts Fig. abstract: The proteomes that make up the collection of proteins in contemporary organisms evolved through recombination and duplication of a limited set of domains. These protein domains are essentially the main components of globular proteins and are the most principal level at which protein function and protein interactions can be understood. An important aspect of domain evolution is their atomic structure and biochemical function, which are both specified by the information in the amino acid sequence. Changes in this information may bring about new folds, functions and protein architectures. With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. Such investigations not only help predict the function of newly discovered proteins, but also assist in mapping unforeseen pathways of evolution and reveal crucial, co-evolving inter- and intra-molecular interactions. In turn this will help us describe how protein domains shaped cellular interaction networks and the dynamics with which they are regulated in the cell. Additionally, these studies can be used for the design of new and optimized protein domains for therapy. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945003/ doi: 10.2174/138920210791616725 id: cord-003316-r5te5xob author: Balloux, Francois title: From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic date: 2018-12-17 words: 7340.0 sentences: 327.0 pages: flesch: 34.0 cache: ./cache/cord-003316-r5te5xob.txt txt: ./txt/cord-003316-r5te5xob.txt summary: WGS-based strain identification gives a far superior resolution In principle, WGS can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. As an example, genome assembly might appear to be a bottleneck for real-time WGS diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. These include, among others: the current costs of WGS, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable AMR and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols. abstract: Hospitals worldwide are facing an increasing incidence of hard-to-treat infections. Limiting infections and providing patients with optimal drug regimens require timely strain identification as well as virulence and drug-resistance profiling. Additionally, prophylactic interventions based on the identification of environmental sources of recurrent infections (e.g., contaminated sinks) and reconstruction of transmission chains (i.e., who infected whom) could help to reduce the incidence of nosocomial infections. WGS could hold the key to solving these issues. However, uptake in the clinic has been slow. Some major scientific and logistical challenges need to be solved before WGS fulfils its potential in clinical microbial diagnostics. In this review we identify major bottlenecks that need to be resolved for WGS to routinely inform clinical intervention and discuss possible solutions. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6249990/ doi: 10.1016/j.tim.2018.08.004 id: cord-291156-zxg3dsm3 author: Bernasconi, Anna title: Empowering Virus Sequences Research through Conceptual Modeling date: 2020-05-01 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: The pandemic outbreak of the coronavirus disease has attracted attention towards the genetic mechanisms of viruses. We hereby present the Viral Conceptual Model (VCM), centered on the virus sequence and described from four perspectives: biological (virus type and hosts/sample), analytical (annotations and variants), organizational (sequencing project) and technical (experimental technology). VCM is inspired by GCM, our previously developed Genomic Conceptual Model, but it introduces many novel concepts, as viral sequences significantly differ from human genomes. When applied to SARS-CoV2 virus, complex conceptual queries upon VCM are able to replicate the search results of recent articles, hence demonstrating huge potential in supporting virology research. In addition to VCM, we also illustrate the data dictionary for patient’s phenotype used by the COVID-19 Host Genetic Initiative. Our effort is part of a broad vision: availability of conceptual models for both human genomics and viruses will provide important opportunities for research, especially if interconnected by the same human being, playing the role of virus host as well as provider of genomic and phenotype information. url: https://doi.org/10.1101/2020.04.29.067637 doi: 10.1101/2020.04.29.067637 id: cord-304869-l6a68tqn author: Bielińska-Wąż, Dorota title: Graphical and numerical representations of DNA sequences: statistical aspects of similarity date: 2011-08-28 words: 15408.0 sentences: 940.0 pages: flesch: 60.0 cache: ./cache/cord-304869-l6a68tqn.txt txt: ./txt/cord-304869-l6a68tqn.txt summary: As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. How to restrict the graphs representing the sequences to two-dimensional plots and how to avoid degeneracies has been the subject of numerous studies which resulted in many graphical representations (see subsequent chapters). It is shown in the last chapter of this work that by using the four-component spectral representation one can recognize the difference in one base between a pair of sequences so it can be used for single nucleotide polymorfism (SNP) analyses which is subject of many investigation, as for example, in a recent work by Bhasi et al. abstract: New approaches aiming at a detailed similarity/dissimilarity analysis of DNA sequences are formulated. Several corrections that enrich the information which may be derived from the alignment methods are proposed. The corrections take into account the distributions along the sequences of the aligned bases (neglected in the standard alignment methods). As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. The studies are supplemented by detailed similarity studies for histones H1 and H4 coding sequences. The data are described according to the latest version of the EMBL database. The work is supplemented by a concise review of the state-of-art graphical representations of DNA sequences. url: https://www.ncbi.nlm.nih.gov/pubmed/32214591/ doi: 10.1007/s10910-011-9890-8 id: cord-310734-6v7oru2l author: Bolatti, Elisa M. title: A Preliminary Study of the Virome of the South American Free-Tailed Bats (Tadarida brasiliensis) and Identification of Two Novel Mammalian Viruses date: 2020-04-09 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: Bats provide important ecosystem services as pollinators, seed dispersers, and/or insect controllers, but they have also been found harboring different viruses with zoonotic potential. Virome studies in bats distributed in Asia, Africa, Europe, and North America have increased dramatically over the past decade, whereas information on viruses infecting South American species is scarce. We explored the virome of Tadarida brasiliensis, an insectivorous New World bat species inhabiting a maternity colony in Rosario (Argentina), by a metagenomic approach. The analysis of five pooled oral/anal swab samples indicated the presence of 43 different taxonomic viral families infecting a wide range of hosts. By conventional nucleic acid detection techniques and/or bioinformatics approaches, the genomes of two novel viruses were completely covered clustering into the Papillomaviridae (Tadarida brasiliensis papillomavirus type 1, TbraPV1) and Genomoviridae (Tadarida brasiliensis gemykibivirus 1, TbGkyV1) families. TbraPV1 is the first papillomavirus type identified in this host and the prototype of a novel genus. TbGkyV1 is the first genomovirus reported in New World bats and constitutes a new species within the genus Gemykibivirus. Our findings extend the knowledge about oral/anal viromes of a South American bat species and contribute to understand the evolution and genetic diversity of the novel characterized viruses. url: https://doi.org/10.3390/v12040422 doi: 10.3390/v12040422 id: cord-334127-wjf8t8vp author: Brister, J. Rodney title: NCBI Viral Genomes Resource date: 2015-01-28 words: 3863.0 sentences: 186.0 pages: flesch: 37.0 cache: ./cache/cord-334127-wjf8t8vp.txt txt: ./txt/cord-334127-wjf8t8vp.txt summary: This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. Whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes (11) (12) (13) , sequence assembly and virus discovery (14) (15) (16) (17) , viral dynamics and evolution (18) (19) (20) and pathogen detection (14, (21) (22) (23) . The second model captures and standardizes host information for all viruses, and whenever a new RefSeq record is created, a manually curated ''viral host'' property is assigned to the relevant species within the NCBI Taxonomy database. The link to the Retrovirus Resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the Retrovirus Genotyping Tool and HIV-1, Human Interaction Database (50, 51) . abstract: Recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. Yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. The NCBI Viral Genomes Resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. The resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. As the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. The rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. url: https://www.ncbi.nlm.nih.gov/pubmed/25428358/ doi: 10.1093/nar/gku1207 id: cord-203232-1nnqx1g9 author: Canturk, Semih title: Machine-Learning Driven Drug Repurposing for COVID-19 date: 2020-06-25 words: 5023.0 sentences: 257.0 pages: flesch: 52.0 cache: ./cache/cord-203232-1nnqx1g9.txt txt: ./txt/cord-203232-1nnqx1g9.txt summary: Using the National Center for Biotechnology Information virus protein database and the DrugVirus database, which provides a comprehensive report of broad-spectrum antiviral agents (BSAAs) and viruses they inhibit, we trained ANN models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. Using sequences for SARS-CoV-2 (the coronavirus that causes COVID-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating COVID-19. For Experiment II, we split the data on virus species, meaning the models were forced to predict drugs for a species that it was not trained on, and have to detect peptide substructures in the amino-acid sequences to suggest drugs. In post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence. abstract: The integration of machine learning methods into bioinformatics provides particular benefits in identifying how therapeutics effective in one context might have utility in an unknown clinical context or against a novel pathology. We aim to discover the underlying associations between viral proteins and antiviral therapeutics that are effective against them by employing neural network models. Using the National Center for Biotechnology Information virus protein database and the DrugVirus database, which provides a comprehensive report of broad-spectrum antiviral agents (BSAAs) and viruses they inhibit, we trained ANN models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. Model training excluded SARS-CoV-2 proteins and included only Phases II, III, IV and Approved level drugs. Using sequences for SARS-CoV-2 (the coronavirus that causes COVID-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating COVID-19. Our results suggest multiple drug candidates, some of which complement recent findings from noteworthy clinical studies. Our in-silico approach to drug repurposing has promise in identifying new drug candidates and treatments for other viruses. url: https://arxiv.org/pdf/2006.14707v1.pdf doi: nan id: cord-328644-odtue60a author: Comandatore, Francesco title: Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes date: 2020-05-28 words: 6535.0 sentences: 301.0 pages: flesch: 50.0 cache: ./cache/cord-328644-odtue60a.txt txt: ./txt/cord-328644-odtue60a.txt summary: These variants might arise during the spread of the epidemic, as viruses are known for their high frequency of mutation, particularly in single stranded RNA viruses -as in the case of SARS-CoV-2 (Sanjuán and Domingo-Calap 2016) , which has a single, positive-strand RNA genome. To have a better insight on the history and spread of the COVID-19 pandemic in Italy and thanks to the sequences deposited in the Gisaid database, we identified 7 non synonymous mutations that are differentially frequent in Italian SARS-CoV-2 strains respect to strains circulating globally. Our analysis allowed us to identify 7 positions in four proteins that present drastic changes in amino acid frequencies when comparing Italian sequences with worldwide sequences available on Gisaid.org on April, 10, 2020 ( Figure 1 ). abstract: The SARS-CoV-2 pandemic that we are currently experiencing is exerting a massive toll both in human lives and economic impact. One of the challenges we must face is to try to understand if and how different variants of the virus emerge and change their frequency in time. Such information can be extremely valuable as it may indicate shifts in aggressiveness, and it could provide useful information to trace the spread of the virus in the population. In this work we identified and traced over time 7 amino acid variants that are present with high frequency in Italy and Europe, but that were absent or present at very low frequencies during the first stages of the epidemic in China and the initial reports in Europe. The analysis of these variants helps defining 6 phylogenetic clades that are currently spreading throughout the world with changes in frequency that are sometimes very fast and dramatic. In the absence of conclusive data at the time of writing, we discuss whether the spread of the variants may be due to a prominent founder effect or if it indicates an adaptive advantage. url: https://doi.org/10.1101/2020.04.30.071027 doi: 10.1101/2020.04.30.071027 id: cord-268549-2lg8i9r1 author: Dai, Qi title: Sequence comparison via polar coordinates representation and curve tree date: 2012-01-07 words: 4360.0 sentences: 272.0 pages: flesch: 59.0 cache: ./cache/cord-268549-2lg8i9r1.txt txt: ./txt/cord-268549-2lg8i9r1.txt summary: It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation abstract: Abstract Sequence comparison has become one of the essential bioinformatics tools in bioinformatics research, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Existing graphical representation methods have achieved promising results in sequence comparison, but there are some design challenges with the graphical representations and feature-based measures. We reported here a new method for sequence comparison. It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. The curve tree was then constructed to numerically characterize the closed curve of biological sequences, and further compared biological sequences by evaluating the distance of the curve tree of the query sequence matching against a corresponding curve tree of the template sequence. The proposed method was tested by phylogenetic analysis, and its performance was further compared with alignment-based methods. The results demonstrate that using polar coordinates representation and curve tree to compare sequences is more efficient. url: https://doi.org/10.1016/j.jtbi.2011.09.030 doi: 10.1016/j.jtbi.2011.09.030 id: cord-002473-2kpxhzbe author: Das, Jayanta Kumar title: Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach date: 2017-03-31 words: 4613.0 sentences: 285.0 pages: flesch: 61.0 cache: ./cache/cord-002473-2kpxhzbe.txt txt: ./txt/cord-002473-2kpxhzbe.txt summary: Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. The primary protein sequence is read as consecutive order pairs serially from first amino acid to the end of sequence, and each order pair is nothing but a connected edge between the two nodes where nodes in the graph are involved with different chemical groups of amino acids. Our method of phylogenetic tree formation used the dissimilarity matrix which is obtained for every pair of sequence on the basis of chemical group specific score of amino acids. Based on the phylogenetic tree of five members, we find that the PpcA and PpcD, PpcB and PpcE are mostly closed with regards to the frequency of amino acids of respective eight chemical groups. abstract: Periplasmic c7 type cytochrome A (PpcA) protein is determined in Geobacter sulfurreducens along with its other four homologs (PpcB-E). From the crystal structure viewpoint the observation emerges that PpcA protein can bind with Deoxycholate (DXCA), while its other homologs do not. But it is yet to be established with certainty the reason behind this from primary protein sequence information. This study is primarily based on primary protein sequence analysis through the chemical basis of embedded amino acids. Firstly, we look for the chemical group specific score of amino acids. Along with this, we have developed a new methodology for the phylogenetic analysis based on chemical group dissimilarities of amino acids. This new methodology is applied to the cytochrome c7 family members and pinpoint how a particular sequence is differing with others. Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. Thirdly, we search for unique patterns as subsequences which are common among the group or specific individual member. In all the cases, we are able to show some distinct features of PpcA that emerges PpcA as an outstanding protein compared to its other homologs, resulting towards its binding with deoxycholate. Similarly, some notable features for the structurally dissimilar protein PpcD compared to the other homologs are also brought out. Further, the five members of cytochrome family being homolog proteins, they must have some common significant features which are also enumerated in this study. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5376323/ doi: 10.1371/journal.pone.0175031 id: cord-004862-yv76yvy5 author: Demers, G. William title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin date: 1989 words: 6659.0 sentences: 347.0 pages: flesch: 62.0 cache: ./cache/cord-004862-yv76yvy5.txt txt: ./txt/cord-004862-yv76yvy5.txt summary: title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. However, the region between the two ORFs is not conserved among species, and this observation is used to indicate possible start and stop codons for the ORFs. ORF-1 encodes a composite protein, and the 5'' half of ORF-1 from L1Oc is related to type II cytoskeletal keratin. The dot-plot analyses in Fig. 6 show that the internal sequence of L1Oc is very similar to both L1Md (mouse) and L1Hs (human) over very long segments, whereas the 5'' and 3'' ends are not conserved between species. abstract: The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. L1Oc repeats have a common 3′ end that terminates in a poly A addition signal and an A-rich tract, but individual repeats have different 5′ ends, indicating a polar truncation from the 5′ end during their synthesis or propagation. As a result of the polar truncations, the 5′ end of L1Oc is present in about 11,000 copies per haploid genome, whereas the 3′ end is present in at least 66,000 copies per haploid genome. One type of L1Oc repeat has internal direct repeats of 78 bp in the 3′ untranslated region, whereas other L1Oc repeats have only one copy of this sequence. The longest repeat sequenced, L1Oc5, is 6.5 kb long, and genomic blot-hybridization data using probes from the 5′ end of L1Oc5 indicate that a full length L1Oc repeat is about 7.5 kb long, extending about 1 kb 5′ to the sequenced region. The L1Oc5 sequence has long open reading frames (ORFs) that correspond to ORF-1 and ORF-2 described in the mouse L1 sequence. In contrast to the overlapping reading frames seen for mouse L1, ORF-1 and ORF-2 are in the same reading frame in rabbit and human L1s, resulting in a discistronic structure. The region between the likely stop codon for ORF-1 and the proposed start codon for ORF-2 is not conserved in interspecies comparisons, which is further evidence that this short region does not encode part of a protein. ORF-1 appears to be a hybrid of sequences, of which the 3′ half is unique to and conserved in mammalian L1 repeats. The 5′ half of ORF-1 is not conserved between mammalian L1 repeats, but this segment of L1Oc is related significantly to type II cytoskeletal keratin. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7087506/ doi: 10.1007/bf02106177 id: cord-339915-8j04y50s author: Deng, Wei title: DV-Curve Representation of Protein Sequences and Its Application date: 2014-05-08 words: 2946.0 sentences: 176.0 pages: flesch: 49.0 cache: ./cache/cord-339915-8j04y50s.txt txt: ./txt/cord-339915-8j04y50s.txt summary: Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins. In this paper, we introduce DV-curve graphical representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model of amino acids. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation New graphical representation of a DNA sequence based on the ordered dinucleotides and its application to sequence analysis Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation Similarity/dissimilarity studies of protein sequences based on a new 2d graphical representation abstract: Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. This graphical representation not only avoids degeneracy, but also has good visualization no matter how long these sequences are, and can reflect the length of protein sequence. Then we transform the 2D-graphical representation into a numerical characterization that can facilitate quantitative comparison of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins. url: https://doi.org/10.1155/2014/203871 doi: 10.1155/2014/203871 id: cord-255194-4i9fc0r7 author: Djikeng, Appolinaire title: Viral genome sequencing by random priming methods date: 2008-01-07 words: 3776.0 sentences: 207.0 pages: flesch: 51.0 cache: ./cache/cord-255194-4i9fc0r7.txt txt: ./txt/cord-255194-4i9fc0r7.txt summary: An RNase treatment step was added to the SISPA protocol to reduce contaminating exogenous RNAs such as ribosomal RNAs. In the case of polyA-tailed viruses, we perform reverse transcription using a combination of random (FR26RV-N) and poly T tagged (FR40RV-T) primers in order to increase the coverage of the 3'' end ( Figure 2 ). Additionally, in order to capture 5'' ends of viral RNA, a random hexamer primer tagged with a conserved sequence at the 5'' end was added to the Klenow reaction (Figure 2 shows a 5'' oligo specific for rhinoviruses). The results of these experiments demonstrate that the SISPA method is very efficient as a genome sequencing method for samples with greater than 10 6 viral particles per RT-PCR reaction ( Figure 5 ). We strongly anticipate that specific adaptations of the SISPA method to conserved regions of different viruses will demonstrate its versatility in a wide range of viral genome sequencing initiatives. abstract: BACKGROUND: Most emerging health threats are of zoonotic origin. For the overwhelming majority, their causative agents are RNA viruses which include but are not limited to HIV, Influenza, SARS, Ebola, Dengue, and Hantavirus. Of increasing importance therefore is a better understanding of global viral diversity to enable better surveillance and prediction of pandemic threats; this will require rapid and flexible methods for complete viral genome sequencing. RESULTS: We have adapted the SISPA methodology [1-3] to genome sequencing of RNA and DNA viruses. We have demonstrated the utility of the method on various types and sources of viruses, obtaining near complete genome sequence of viruses ranging in size from 3,000–15,000 kb with a median depth of coverage of 14.33. We used this technique to generate full viral genome sequence in the presence of host contaminants, using viral preparations from cell culture supernatant, allantoic fluid and fecal matter. CONCLUSION: The method described is of great utility in generating whole genome assemblies for viruses with little or no available sequence information, viruses from greatly divergent families, previously uncharacterized viruses, or to more fully describe mixed viral infections. url: https://doi.org/10.1186/1471-2164-9-5 doi: 10.1186/1471-2164-9-5 id: cord-266288-buc4dd5y author: Dong, Rui title: A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance date: 2019-04-09 words: 5247.0 sentences: 300.0 pages: flesch: 61.0 cache: ./cache/cord-266288-buc4dd5y.txt txt: ./txt/cord-266288-buc4dd5y.txt summary: Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ(18). The natural vector method performs well on many datasets (Deng et al., 2011; Yu et al., 2013b; Hoang et al., 2016; Li et al., 2016) , however, it only considers the number, average position and dispersion of positions of each nucleotide. In this paper, we propose a new Accumulated Natural Vector (ANV) method, which not only considers the basic property of each nucleotide, but also the covariance between them. In this paper, we propose an Accumulated Natural Vector approach, which projects each sequence into a point in R 18 , where the additional six dimensions describe the covariance between nucleotides. abstract: Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ(18). By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in ℝ(18). The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method. url: https://www.ncbi.nlm.nih.gov/pubmed/31024610/ doi: 10.3389/fgene.2019.00234 id: cord-033010-o5kiadfm author: Durojaye, Olanrewaju Ayodeji title: Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study date: 2020-10-02 words: 8125.0 sentences: 375.0 pages: flesch: 53.0 cache: ./cache/cord-033010-o5kiadfm.txt txt: ./txt/cord-033010-o5kiadfm.txt summary: RESULTS: This study describes the detailed computational process by which the 2019-nCoV main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3D structure. Our current study took advantage of the availability of the SARS CoV main proteinase amino acid sequence to map out the nucleotide coding region for the same protein in the 2019-nCoV. The predicted secondary structure composition shows a high degree of alpha helix and beta sheets, respectively, occupying 45 and 47% of the total residues with the percentage loop occupancy at 8% regarded as comparative modeling, constructs atomic models based on known structures or structures that have been determined experimentally and likewise share more than 40% sequence homology. abstract: BACKGROUND: The 2019-nCoV which is regarded as a novel coronavirus is a positive-sense single-stranded RNA virus. It is infectious to humans and is the cause of the ongoing coronavirus outbreak which has elicited an emergency in public health and a call for immediate international concern has been linked to it. The coronavirus main proteinase which is also known as the 3C-like protease (3CLpro) is a very important protein in all coronaviruses for the role it plays in the replication of the virus and the proteolytic processing of the viral polyproteins. The resultant cytotoxic effect which is a product of consistent viral replication and proteolytic processing of polyproteins can be greatly reduced through the inhibition of the viral main proteinase activities. This makes the 3C-like protease of the coronavirus a potential and promising target for therapeutic agents against the viral infection. RESULTS: This study describes the detailed computational process by which the 2019-nCoV main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3D structure. Comparative physiochemical studies were carried out on the resultant target protein and its template while selected HIV protease inhibitors were docked against the protein binding sites which contained no co-crystallized ligand. CONCLUSION: In line with results from this study which has shown great consistency with other scientific findings on coronaviruses, we recommend the administration of the selected HIV protease inhibitors as first-line therapeutic agents for the treatment of the current coronavirus epidemic. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7529470/ doi: 10.1186/s43042-020-00081-5 id: cord-001786-ybd8hi8y author: Dutilh, Bas E title: Metagenomic ventures into outer sequence space date: 2014-12-15 words: 2193.0 sentences: 121.0 pages: flesch: 44.0 cache: ./cache/cord-001786-ybd8hi8y.txt txt: ./txt/cord-001786-ybd8hi8y.txt summary: These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. Applications include the use of metagenomics for the discovery of novel genetic functionality, 2 for describing microbial ecosystems and tracking their variation, 3 in untargeted medical diagnostics and forensics, 4 and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. The level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database. abstract: Sequencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as “unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as “biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4588555/ doi: 10.4161/21597081.2014.979664 id: cord-334394-qgyzk7th author: Edgar, Robert C. title: Petabase-scale sequence alignment catalyses viral discovery date: 2020-08-10 words: 8134.0 sentences: 423.0 pages: flesch: 51.0 cache: ./cache/cord-334394-qgyzk7th.txt txt: ./txt/cord-334394-qgyzk7th.txt summary: To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To expand the known repertoire of viruses and catalyse global virus discovery, in particular for Coronaviridae (CoV) family, we developed the Serratus cloud computing architecture for ultra-high throughput sequence alignment. We aligned 3,837,755 public RNA-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all GenBank CoV records clustered at 99% identity plus all non-retroviral RefSeq records for vertebrate viruses (see Methods and Extended Table 1 ). We performed de novo assembly on 52,772 runs potentially containing CoV sequencing reads by combining 37,131 SRA accessions identified by the Serratus search with 18,584 identified by an ongoing cataloguing initiative of the SRA called STAT [5] . abstract: Public sequence data represents a major opportunity for viral discovery, but its exploration has been inhibited by a lack of efficient methods for searching this corpus, which is currently at the petabase scale and growing exponentially. To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To implement this strategy, we developed a cloud computing architecture, Serratus, tailored for ultra-high throughput sequence alignment at the petabase scale. From this search, we identified and assembled thousands of CoV and CoV-like genomes and genome fragments ranging from known strains to putatively novel genera. We generalise this strategy to other viral families, identifying several novel deltaviruses and huge bacteriophages. To catalyse a new era of viral discovery we made millions of viral alignments and family identifications freely available to the research community. Expanding the known diversity and zoonotic reservoirs of CoV and other emerging pathogens can accelerate vaccine and therapeutic developments for the current pandemic, and help us anticipate and mitigate future ones. url: https://doi.org/10.1101/2020.08.07.241729 doi: 10.1101/2020.08.07.241729 id: cord-011565-8ncgldaq author: Elworth, R A Leo title: To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics date: 2020-06-04 words: 12960.0 sentences: 717.0 pages: flesch: 53.0 cache: ./cache/cord-011565-8ncgldaq.txt txt: ./txt/cord-011565-8ncgldaq.txt summary: For instance, in (1) a comprehensive review was performed covering probabilistic algorithms and data structures such as MinHash (6) and Locality Sensitive Hashing (LSH) (7) , Count-Min Sketch (CMS) (8) , HyperLogLog (9) and Bloom filters (10) . A more in depth discussion of many of these topics can also be found in (3, 4) includes a thorough review of compressed string indexes, LSH via sketches, CMS, Bloom filters, and minimizers (13) , with accompanying applications in genomics for each. With this approach, RAMBO can determine which datasets contain a given k-mer or sequence using far fewer Bloom filter queries, yielding a very fast sublinear-time sequence search algorithm (68) . One of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically MinHash and Minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses. abstract: As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7261164/ doi: 10.1093/nar/gkaa265 id: cord-256278-jvfjf7aw author: Feng, Jie title: New method for comparing DNA primary sequences based on a discrimination measure date: 2010-10-21 words: 2864.0 sentences: 186.0 pages: flesch: 53.0 cache: ./cache/cord-256278-jvfjf7aw.txt txt: ./txt/cord-256278-jvfjf7aw.txt summary: title: New method for comparing DNA primary sequences based on a discrimination measure Three years after, Blaisdell (1989) proved that the dissimilarity values observed by using distance measures based on word frequencies are directly related to the ones requiring sequence alignment. In Table 2 , we present the similarity/dissimilarity matrix for the full DNA sequences of bÀglobin gene from 10 species listed in Table 1 by our new method. In Fig. 2, we show the phylogenetic tree of 10 bÀglobin gene sequences based on the distance matrix DM, using NJ method. In this paper, we propose a new method for the similarity analysis of DNA sequences. Our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of DNA sequences. Analysis of similarity/ dissimilarity of DNA sequences based on novel 2-D graphical representation A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words abstract: Abstract We introduce a new approach to compare DNA primary sequences. The core of our method is a new measure of pairwise distances among sequences. Using the primitive discrimination substrings of sequence S and Q, a discrimination measure DM(S, Q) is defined for the similarity analysis of them. The proposed method does not require multiple alignments and is fully automatic. To illustrate its utility, we construct phylogenetic trees on two independent data sets. The results indicate that the method is efficient and powerful. url: https://www.sciencedirect.com/science/article/pii/S0022519310003978 doi: 10.1016/j.jtbi.2010.07.040 id: cord-016594-lj0us1dq author: Flower, Darren R. title: Identification of Candidate Vaccine Antigens In Silico date: 2012-09-28 words: 12570.0 sentences: 653.0 pages: flesch: 37.0 cache: ./cache/cord-016594-lj0us1dq.txt txt: ./txt/cord-016594-lj0us1dq.txt summary: In the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. When looking at a reverse vaccinology process, the discovery of candidate subunit vaccines begins with a microbial genome, perhaps newly sequence, progresses through an extensive computational stage, ultimately to deliver a shortlist of antigens which can be validated through subsequent laboratory examination. Conventional empirical, experimental, laboratory-based microbiological ways to identify putative candidate antigens require cultivation of target pathogenic micro-organisms, followed by teasing out their component proteins, analysis in a series of in-vitro and in-vivo assays, animal models and with the ultimate objective of isolating one or two proteins displaying protective immunity. abstract: The identification of immunogenic whole-protein antigens is fundamental to the successful discovery of candidate subunit vaccines and their rapid, effective, and efficient transformation into clinically useful, commercially successful vaccine formulations. In the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. Reference is also made to the recent emergence of various expert systems for protein antigen identification. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7120937/ doi: 10.1007/978-1-4614-5070-2_3 id: cord-001974-wjf3c7a7 author: Friis-Nielsen, Jens title: Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers date: 2016-02-19 words: 5773.0 sentences: 348.0 pages: flesch: 48.0 cache: ./cache/cord-001974-wjf3c7a7.txt txt: ./txt/cord-001974-wjf3c7a7.txt summary: Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. The datasets went through a sequential pipeline with modules (in order) of preprocessing, computational subtraction of host sequences, low-complexity sequence removal, sequence assembly, clustering, association to metadata features, and taxonomical annotation. Associations from the shortest mode tended to have higher dispersion in the range of ORs. Furthermore, one block of clustering results using global alignment mode, alignment length based on the shortest contig, and a minimum sequence identity of 90% (c09ˆaSyG1), had an overall high range of ORs as well as the highest minimum values. The clusters are significantly associated with lowest p-values to biological features and the species annotations are described by HMP. abstract: Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4776208/ doi: 10.3390/v8020053 id: cord-016798-tv2ntug6 author: Gautam, Ablesh title: Bioinformatics Applications in Advancing Animal Virus Research date: 2019-06-06 words: 6978.0 sentences: 405.0 pages: flesch: 44.0 cache: ./cache/cord-016798-tv2ntug6.txt txt: ./txt/cord-016798-tv2ntug6.txt summary: The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al. abstract: Viruses serve as infectious agents for all living entities. There have been various research groups that focus on understanding the viruses in terms of their host-viral relationships, pathogenesis and immune evasion. However, with the current advances in the field of science, now the research field has widened up at the ‘omics’ level. Apparently, generation of viral sequence data has been increasing. There are numerous bioinformatics tools available that not only aid in analysing such sequence data but also aid in deducing useful information that can be exploited in developing preventive and therapeutic measures. This chapter elaborates on bioinformatics tools that are specifically designed for animal viruses as well as other generic tools that can be exploited to study animal viruses. The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. Various databases that organize information on animal and human viruses have also been described. The chapter will converse on overview of the current advances, online and downloadable tools and databases in the field of bioinformatics that will enable the researchers to study animal viruses at gene level. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121192/ doi: 10.1007/978-981-13-9073-9_23 id: cord-302798-q0mbngqy author: Ge, Junwei title: Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China date: 2018-06-14 words: 4343.0 sentences: 273.0 pages: flesch: 58.0 cache: ./cache/cord-302798-q0mbngqy.txt txt: ./txt/cord-302798-q0mbngqy.txt summary: In this study, the role of circoviruses (CVs) in mink acute gastroenteritis was investigated, and the MiCV genome was molecularly characterized through sequence analysis. MiCVs and previously characterized CVs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus DNA replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (Rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5ʹ region; and (iv) motifs in Rep. Pairwise comparisons showed that the capsid proteins of MiCV shared the highest amino acid sequence identity with those of porcine CV (PCV) 2 (45.4%) and bat CV (BatCV) 1 (45.4%). In our study, sequence analysis confirmed that MiCV genomes displayed the characteristics of members of the genus Circovirus, and the common features included their genome organization, the presence of a potential stem-loop and conserved nonanucleotide motif postulated to be the origin of viral DNA replication, and major ORFs and repeats [26, 27] . abstract: Mink circovirus (MiCV), a virus that was newly discovered in 2013, has been associated with enteric disease. However, its etiological role in acute gastroenteritis is unclear, and its genetic characteristics are poorly described. In this study, the role of circoviruses (CVs) in mink acute gastroenteritis was investigated, and the MiCV genome was molecularly characterized through sequence analysis. Detection results demonstrated that MiCV was the only pathogen found in this infection. MiCVs and previously characterized CVs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus DNA replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (Rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5ʹ region; and (iv) motifs in Rep. Pairwise comparisons showed that the capsid proteins of MiCV shared the highest amino acid sequence identity with those of porcine CV (PCV) 2 (45.4%) and bat CV (BatCV) 1 (45.4%). The amino acid sequence identity levels of Rep shared by MiCV with BatCV 1 (79.7%) and dog CV (dogCV) (54.5%) were broadly similar to those with starling CV (51.1%) and PCVs (46.5%). Phylogenetic analysis indicated that MiCVs were more closely related to mammalian CVs, such as BatCV, PCV, and dogCV, than to other animal CVs. Among mammalian CVs, MiCV and BatCV 1 were the most closely related. This study could contribute to understanding the potential pathogenicity of MiCV and the evolutionary and pathogenic characteristics of mammalian CVs. url: https://www.ncbi.nlm.nih.gov/pubmed/29948383/ doi: 10.1007/s00705-018-3908-5 id: cord-017932-vmtjc8ct author: Georgiev, Vassil St. title: Genomic and Postgenomic Research date: 2009 words: 8476.0 sentences: 360.0 pages: flesch: 36.0 cache: ./cache/cord-017932-vmtjc8ct.txt txt: ./txt/cord-017932-vmtjc8ct.txt summary: The family Enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (Salmonella, Yersinia, Klebsiella, Shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic Escherichia coli K12. To this end, NIAID has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. The availability of microbial and human DNA sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host''s immune response and an individual''s genetic susceptibility to pathogens. The PFGRC was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases. abstract: The word genomics was first coined by T. Roderick from the Jackson Laboratories in 1986 as the name for the new field of science focused on the analysis and comparison of complete genome sequences of organisms and related high-throughput technologies. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7122628/ doi: 10.1007/978-1-60327-297-1_25 id: cord-325043-vqjhiv7p author: Gorbalenya, Alexander E. title: An NTP-binding motif is the most conserved sequence in a highly diverged monophyletic group of proteins involved in positive strand RNA viral replication date: 1989 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: NTP-motif, a consensus sequence previously shown to be characteristic of numerous NTP-utilizing enzymes, was identified in nonstructural proteins of several groups of positive-strand RNA viruses. These groups include picorna-, alpha-, and coronaviruses infecting animals and como-, poty-, tobamo-, tricorna-, hordei-, and furoviruses of plants, totalling 21 viruses. It has been demonstrated that the viral NTP-motif-containing proteins constitute three distinct families, the sequences within each family being similar to each other at a statistically highly significant level. A lower, but still valid similarity has also been revealed between the families. An overall alignment has been generated, which includes several highly conserved sequence stretches. The two most prominent of the latter contain the socalled “A” and “B” sites of the NTP-motif, with four of the five invariant amino acid residues observed within these sequences. These observations, taken together with the results of comparative analysis of the positions occupied by respective proteins (domains) in viral multidomain proteins, suggest that all the NTP-motif-containing proteins of positive-strand RNA viruses are homologous, constituting a highly diverged monophyletic group. In this group the “A” and “B” sites of the NTP-motif are the most conserved sequences and, by inference, should play the principal role in the functioning of the proteins. A hypothesis is proposed that all these proteins posses NTP-binding capacity and possibly NTPase activity, performing some NTP-dependent function in viral RNA replication. The importance of phylogenetic analysis for the assessment of the significance of the occurrence of the NTP-motif (and of sequence motifs of this sort in general) in proteins is emphasized. url: https://www.ncbi.nlm.nih.gov/pubmed/2522556/ doi: 10.1007/bf02102483 id: cord-328259-3g4klpyg author: Guajardo-Leiva, Sergio title: Metagenomic Insights into the Sewage RNA Virosphere of a Large City date: 2020-09-21 words: 7626.0 sentences: 370.0 pages: flesch: 47.0 cache: ./cache/cord-328259-3g4klpyg.txt txt: ./txt/cord-328259-3g4klpyg.txt summary: Despite the overrepresentation of dsRNA viruses, our results show that Santiago''s sewage RNA virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). Viral sequences identified as Partitiviridae-like viruses included in the "unclassified RNA viruses ShiM-2016" category in the NCBI taxonomy (~25% abundance; Figure 2B ) and Totiviriade family were also highly abundant in treated and untreated sewage samples from the EU [5, 7] . Therefore, the abundance of these viruses in the Trebal metagenome can expand the known sequence space associated with this family (only 10 genomes are currently available in the NCBI database) and contribute to a better understanding of the bacteriophage biology related to RNA genomes. Taken together, our results show that metagenomic surveys of RNA viruses in sewage samples and the use of HMMs could uncover extraordinary viral diversity through the detection of remote homologs in these human-impacted environments. abstract: Sewage-associated viruses can cause several human and animal diseases, such as gastroenteritis, hepatitis, and respiratory infections. Therefore, their detection in wastewater can reflect current infections within the source population. To date, no viral study has been performed using the sewage of any large South American city. In this study, we used viral metagenomics to obtain a single sample snapshot of the RNA virosphere in the wastewater from Santiago de Chile, the seventh largest city in the Americas. Despite the overrepresentation of dsRNA viruses, our results show that Santiago’s sewage RNA virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). Interestingly, we discovered three novel genogroups within the Picobirnaviridae family that can fill major gaps in this taxa’s evolutionary history. We also demonstrated the dominance of emerging Rotavirus genotypes, such as G8 and G6, that have displaced other classical genotypes, which is consistent with recent clinical reports. This study supports the usefulness of sewage viral metagenomics for public health surveillance. Moreover, it demonstrates the need to monitor the viral component during the wastewater treatment and recycling process, where this virome can constitute a reservoir of human pathogens. url: https://doi.org/10.3390/v12091050 doi: 10.3390/v12091050 id: cord-354465-5nqrrnqr author: Haslinger, Christian title: RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties date: 1999 words: 10341.0 sentences: 756.0 pages: flesch: 67.0 cache: ./cache/cord-354465-5nqrrnqr.txt txt: ./txt/cord-354465-5nqrrnqr.txt summary: Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. In case of one particular class of biopolymers, the ribonucleic acid (RNA) molecules, decoding of information stored in the sequence can be properly decomposed into two steps: (i) formation of the secondary structure, that is, of the pattern of Watson-Crick (and GU) base pairs, and (ii) the embedding of the contact structure in three-dimensional space. On the other hand, an increasing number of experimental findings, as well as results from comparative sequence analysis, suggest that pseudo-knots are important structural elements in many RNA molecules (Westhof and Jaeger, 1992) . abstract: The secondary structures of nucleic acids form a particularly important class of contact structures. Many important RNA molecules, however, contain pseudo-knots, a structural feature that is excluded explicitly from the conventional definition of secondary structures. We propose here a generalization of secondary structures incorporating ‘non-nested’ pseudo-knots, which we call bi-secondary structures, and discuss measures for the complexity of more general contact structures based on their graph-theoretical properties. Bi-secondary structures are planar trivalent graphs that are characterized by special embedding properties. We derive exact upper bounds on their number (as a function of the chain length n) implying that there are fewer different structures than sequences. Computational results show that the number of bi-secondary structures grows approximately like 2.35(n). Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. We find a large fraction of neutral mutations and, in particular, networks of sequences that fold into the same shape. These neutral networks percolate through the entire sequence space. url: https://www.ncbi.nlm.nih.gov/pubmed/17883226/ doi: 10.1006/bulm.1998.0085 id: cord-348427-worgd0xu author: Hatcher, Eneida L. title: Virus Variation Resource – improved response to emergent viral outbreaks date: 2017-01-04 words: 5552.0 sentences: 258.0 pages: flesch: 48.0 cache: ./cache/cord-348427-worgd0xu.txt txt: ./txt/cord-348427-worgd0xu.txt summary: The resource now includes expanded data processing pipelines and analysis tools, and supports selection and retrieval of nucleotide and protein sequences from four new viral groups: Ebolaviruses, MERS coronavirus, rotavirus, and Zika virus ( Table 2 ). New processes have been added to parse source descriptor terms from Gen-Bank records and map these to controlled vocabulary, and the resource now supports retrieval of sequences based on standardized isolation source and host terms in addition to standardized gene and protein names. The resource includes data processing pipelines that retrieve sequences from GenBank, provide standardized gene and protein an-notation, and map sequence source descriptors (i.e. metadata) to uniform vocabularies. To resolve this issue, the Virus Variation database loading pipeline parses Gen-Bank records, identifies important metadata terms, such as sample isolation host, date, country and source, and maps these to a standardized vocabulary using a hierarchical approach. abstract: The Virus Variation Resource is a value-added viral sequence data resource hosted by the National Center for Biotechnology Information. The resource is located at http://www.ncbi.nlm.nih.gov/genome/viruses/variation/ and includes modules for seven viral groups: influenza virus, Dengue virus, West Nile virus, Ebolavirus, MERS coronavirus, Rotavirus A and Zika virus. Each module is supported by pipelines that scan newly released GenBank records, annotate genes and proteins and parse sample descriptors and then map them to controlled vocabulary. These processes in turn support a purpose-built search interface where users can select sequences based on standardized gene, protein and metadata terms. Once sequences are selected, a suite of tools for downloading data, multi-sequence alignment and tree building supports a variety of user directed activities. This manuscript describes a series of features and functionalities recently added to the Virus Variation Resource. url: https://doi.org/10.1093/nar/gkw1065 doi: 10.1093/nar/gkw1065 id: cord-263987-ff6kor0c author: Holmes, Ian H. title: Solving the master equation for Indels date: 2017-05-12 words: 7131.0 sentences: 357.0 pages: flesch: 44.0 cache: ./cache/cord-263987-ff6kor0c.txt txt: ./txt/cord-263987-ff6kor0c.txt summary: BACKGROUND: Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT: This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS: While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances. abstract: BACKGROUND: Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT: This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS: While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances. url: https://www.ncbi.nlm.nih.gov/pubmed/28494756/ doi: 10.1186/s12859-017-1665-1 id: cord-330067-ujhgb3b0 author: Huang, Yi title: CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes date: 2007-10-02 words: 3007.0 sentences: 168.0 pages: flesch: 55.0 cache: ./cache/cord-330067-ujhgb3b0.txt txt: ./txt/cord-330067-ujhgb3b0.txt summary: To overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, CoVDB (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. CoVDB provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. In CoVDB, with the aim of facilitating gene retrieval, we tried to unify the naming of these non-structural proteins from different groups of coronaviruses. When we compared their putative amino acid sequences to the corresponding ones in other group 1 coronavirus genomes using BLAST, as well as searching for conserved domains using motifscan, results showed that the putative proteins encoded by these ORFs belonged to a protein family in Pfam originally assigned as ''Corona_NS3b'' (accession number PF03053). database, CoVDB, of annotated coronavirus genes and genomes, which offers efficient batch sequence retrieval and analysis. abstract: The recent SARS epidemic has boosted interest in the discovery of novel human and animal coronaviruses. By July 2007, more than 3000 coronavirus sequence records, including 264 complete genomes, are available in GenBank. The number of coronavirus species with complete genomes available has increased from 9 in 2003 to 25 in 2007, of which six, including coronavirus HKU1, bat SARS coronavirus, group 1 bat coronavirus HKU2, groups 2c and 2d coronaviruses, were sequenced by our laboratory. To overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, CoVDB (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. CoVDB provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. Sequences can be directly downloaded from the website in FASTA format. CoVDB also provides detailed annotation of all coronavirus sequences using a standardized nomenclature system, and overcomes the problems of duplicated and identical sequences in other databases. For complete genomes, a single representative sequence for each species is available for comparative analysis such as phylogenetic studies. With the annotated sequences in CoVDB, more specific blast search results can be generated for efficient downstream analysis. url: https://www.ncbi.nlm.nih.gov/pubmed/17913743/ doi: 10.1093/nar/gkm754 id: cord-325985-xfzhn1n1 author: Jabado, Omar J. title: Comprehensive viral oligonucleotide probe design using conserved protein regions date: 2007-12-13 words: 4260.0 sentences: 227.0 pages: flesch: 47.0 cache: ./cache/cord-325985-xfzhn1n1.txt txt: ./txt/cord-325985-xfzhn1n1.txt summary: The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. Our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. The EMBL nucleotide sequence database [July 2007, Release 91; 461,353 nucleic acid sequences (31) ] was chosen as the reference for this study because it is tightly integrated with the Pfam protein family database (23, 32 Taxon growth was estimated using a standard least squares method, with the SPSS statistical package. We have described a method that capitalizes on the Pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions. abstract: Oligonucleotide microarrays have been applied to microbial surveillance and discovery where highly multiplexed assays are required to address a wide range of genetic targets. Although printing density continues to increase, the design of comprehensive microbial probe sets remains a daunting challenge, particularly in virology where rapid sequence evolution and database expansion confound static solutions. Here, we present a strategy for probe design based on protein sequences that is responsive to the unique problems posed in virus detection and discovery. The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. In silico testing using an experimentally derived thermodynamic model indicated near complete coverage of the viral sequence database. url: https://www.ncbi.nlm.nih.gov/pubmed/18079152/ doi: 10.1093/nar/gkm1106 id: cord-017354-cndb031c author: Janies, D. title: Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases date: 2008 words: 12429.0 sentences: 648.0 pages: flesch: 45.0 cache: ./cache/cord-017354-cndb031c.txt txt: ./txt/cord-017354-cndb031c.txt summary: The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. Given a tree and a data matrix of sequences and features, the parsimony method can pinpoint the branches on which certain evolutionary events are inferred to occur between ancestor or descendent. Phylogenetic analysis of large genomic datasets can present several nested NPcomplete problems: multiple alignment, tree-search, and in some cases, gene order and complement differences among organisms. We provide exemplar cases in which phylogenetic analyses of viral genomes have been crucial to understand complex patterns of transmission among animal and human hosts: Severe Acute Respiratory Syndrome (SARS) [KSI03] and influenza [WEB92] . Molecular phylogenetic analyses of the nucleotide or inferred amino acid sequence data from various viral isolates can then be used to reconstruct the history of the transmission events the virus among hosts. abstract: Microorganisms that cause infectious diseases present critical issues of national security, public health, and economic welfare. For example, in recent years, highly pathogenic strains of avian influenza have emerged in Asia, spread through Eastern Europe, and threaten to become pandemic. As demonstrated by the coordinated response to Severe Acute Respiratory Syndrome (SARS) and influenza, agents of infectious disease are being addressed via large-scale genomic sequencing. The goal of genomic sequencing projects are to rapidly put large amounts of data in the public domain to accelerate research on disease surveillance, treatment, and prevention. However, our ability to derive information from large comparative genomic datasets lags far behind acquisition. Here we review the computational challenges of comparative genomic analyses, specifically sequence alignment and reconstruction of phylogenetic trees. We present novel analytical results on two important infectious diseases, Severe Acute Respiratory Syndrome (SARS) and influenza. SARS and influenza have similarities and important differences both as biological and comparative genomic analysis problems. Influenza viruses (Orthymxyoviridae) are RNA based. Current evidence indicates that influenza viruses originate in aquatic birds from wild populations. Influenza has been studied for decades via well-coordinated international efforts. These efforts center on surveillance via antibody characterization of the hemagglutinin (HA) and neuraminidase (N) proteins of the circulating strains to inform vaccine design. However, we still do not have a clear understanding of (1) various transmission pathways such as the role of intermediate hosts like swine and domestic birds and (2) the key mutation and genomic recombination events that underlie periodic pandemics of influenza. In the past 30 years, sequence data from HA and N loci has become an important data type. In the past year, full genomic data has become prominent. These data present exciting opportunities to address unanswered questions in influenza pandemics. SARS is caused by a previously unrecognized lineage of coronavirus, SARS-CoV, which like influenza has an RNA based genome. Although SARS-CoV is widely believed to have originated in animals, there remains disagreement over the candidate animal source that lead to the original outbreak of SARS. In contrast to the long history of the study of influenza, SARS was only recognized in late 2002 and the virus that causes SARS has been documented primarily by genomic sequencing. In the past, most studies of influenza were performed on a limited number of isolates and genes suited to a particular problem. Major goals in science today are to understand emerging diseases in broad geographic, environmental, societal, biological, and genomic contexts. Synthesizing diverse information brought together by various researchers is important to find out what can be done to prevent future outbreaks [JON03]. Thus comprehensive means to organize and analyze large amounts of diverse information are critical. For example, the relationships of isolates and patterns of genomic change observed in large datasets might not be consistent with hypotheses formed on partial data. Moreover when researchers rely on partial datasets, they restrict the range of possible discoveries. Phylogenetics is well suited to the complex task of understanding emerging infectious disease. Phylogenetic analyses can test many hypotheses by comparing diverse isolates collected from various hosts, environments, and points in time and organizing these data into various evolutionary scenarios. The products of a phylogenetic analysis are a graphical tree of ancestor–descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. However, this synthesis comes at a price. The cost of computation of phylogenetic analysis expands combinatorially as the number of isolates considered increases. Thus, large datasets like those currently produced are commonly considered intractable. We address this problem with synergistic development of heuristics tree search strategies and parallel computing. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121896/ doi: 10.1007/978-3-540-74331-6_2 id: cord-017584-9rx4jlw8 author: Kim, Kwangsoo title: Selecting Genotyping Oligo Probes Via Logical Analysis of Data date: 2007 words: 3665.0 sentences: 216.0 pages: flesch: 57.0 cache: ./cache/cord-017584-9rx4jlw8.txt txt: ./txt/cord-017584-9rx4jlw8.txt summary: Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. As for the organization of this paper, we develop an effective method for selecting short oligo probes in Section 2 (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in Section 3 with using viral genomic sequences from the Los Alamos National Laboratory and the National Center of Biotechnology Information websites. abstract: Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. These results well illustrate the utility of the proposed method in genotyping applications. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7122177/ doi: 10.1007/978-3-540-72665-4_8 id: cord-324021-y1vr1db0 author: Kozak, M. title: Determinants of translational fidelity and efficiency in vertebrate mRNAs date: 1994-12-31 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: Abstract This article reviews current knowledge on the mechanisms affecting the fidelity of initiation codon selection, and discusses the effects of structural features in the 5′-non-coding region on the efficiency of translation of messenger RNA molecules. url: https://www.sciencedirect.com/science/article/pii/0300908494901821 doi: 10.1016/0300-9084(94)90182-1 id: cord-353290-1wi1dhv6 author: Kustin, Talia title: Biased mutation and selection in RNA viruses date: 2020-09-28 words: 7611.0 sentences: 402.0 pages: flesch: 52.0 cache: ./cache/cord-353290-1wi1dhv6.txt txt: ./txt/cord-353290-1wi1dhv6.txt summary: We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses. Nevertheless, RNA viruses do share several common features that drive their evolution: (a) their ultimate dependence on the cell, (b) their high mutation rates, (c) strong purifying selection derived from constraints operating on a small and densely coding genome, and (d) sporadic but powerful positive selection driven by an evolutionary arms race with the host they infect. Two non-mutually exclusive hypotheses may be put forth to explain the consistent pattern of A-richness that we observe: there is selection for more A in viral sequences, and/or there is a mutational bias that leads to more A in genomes of viruses. abstract: RNA viruses are responsible for some of the worst pandemics known to mankind, including outbreaks of Influenza, Ebola, and the recent COVID-19. One major challenge in tackling RNA viruses is the fact they are extremely genetically diverse. Nevertheless, they share common features that include their dependence on host cells for replication, and high mutation rates. We set out to search for shared evolutionary characteristics that may aid in gaining a broader understanding of RNA virus evolution, and constructed a phylogeny-based dataset spanning thousands of sequences from diverse single-stranded RNA viruses of animals. Strikingly, we found that the vast majority of these viruses have a skewed nucleotide composition, manifested as adenine rich (A-rich) coding sequences. In order to test whether A-richness is driven by selection or by biased mutation processes, we harnessed the effects of incomplete purifying selection at the tips of virus phylogenies. Our results revealed consistent mutational biases towards U rather than A in genomes of all viruses. In +ssRNA viruses we found that this bias is compensated by selection against U and selection for A, which leads to A-rich genomes. In -ssRNA viruses the genomic mutational bias towards U on the negative strand manifests as A-rich coding sequences, on the positive strand. We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses. url: https://doi.org/10.1093/molbev/msaa247 doi: 10.1093/molbev/msaa247 id: cord-001340-kqcx7lrq author: Ladner, Jason T. title: Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing date: 2014-06-17 words: 2512.0 sentences: 121.0 pages: flesch: 40.0 cache: ./cache/cord-001340-kqcx7lrq.txt txt: ./txt/cord-001340-kqcx7lrq.txt summary: Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/ DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization. abstract: Thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. However, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences. Here, we propose five “standard” categories that encompass all stages of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. We also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade-offs associated with different levels of finishing. Our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4068259/ doi: 10.1128/mbio.01360-14 id: cord-321150-ev6acl7b author: Lam, Ha Minh title: Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm date: 2017-10-03 words: 3184.0 sentences: 161.0 pages: flesch: 50.0 cache: ./cache/cord-321150-ev6acl7b.txt txt: ./txt/cord-321150-ev6acl7b.txt summary: Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed. A strong descent or ascent in the middle of a HGRW indicates that one type of informative site exhibits clustering, and the properties of the random walk can be used to compute exact probabilities of this occurring. To illustrate improved runtimes and memory usage of the new 3SEQ algorithm, we searched for recombinants among large sequence data sets of dengue virus serotype 2, Ebola virus, the coronavirus responsible for Middle-East Respiratory Syndrome (MERS) and Zika virus; see table 1. The genomic alignments of MERS and Zika virus contained 1,150 and 2,792 polymorphic sites, respectively, and >99.9% triplets were able to be tested for mosaicism with exact P values. abstract: Identifying recombinant sequences in an era of large genomic databases is challenging as it requires an efficient algorithm to identify candidate recombinants and parents, as well as appropriate statistical methods to correct for the large number of comparisons performed. In 2007, a computation was introduced for an exact nonparametric mosaicism statistic that gave high-precision P values for putative recombinants. This exact computation meant that multiple-comparisons corrected P values also had high precision, which is crucial when performing millions or billions of tests in large databases. Here, we introduce an improvement to the algorithmic complexity of this computation from O(mn(3)) to O(mn(2)), where m and n are the numbers of recombination-informative sites in the candidate recombinant. This new computation allows for recombination analysis to be performed in alignments with thousands of polymorphic sites. Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed. url: https://doi.org/10.1093/molbev/msx263 doi: 10.1093/molbev/msx263 id: cord-025610-7vouj8pp author: Latif, Seemab title: Backward-Forward Sequence Generative Network for Multiple Lexical Constraints date: 2020-05-06 words: 3923.0 sentences: 230.0 pages: flesch: 50.0 cache: ./cache/cord-025610-7vouj8pp.txt txt: ./txt/cord-025610-7vouj8pp.txt summary: In this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. Recently, Recurrent Neural Networks (RNNs) and their variants such as Long Short Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs) based language models have shown promising results in generating high quality text sequences, especially when the input and output are of variable length. first proposed multiple variants of Backward and Forward (B/F) language models based on GRUs for constrained sentence generation [13] . Therefore, we have proposed a neural probabilistic Backward-Forward architecture that can generate high quality sequences, with word embedding substitution method to satisfy multiple constraints. In this paper, we have proposed a novel method, dubbed Neural Probabilistic Backward-Forward language model and word embedding substitution method to address the issue of lexical constrained sequence generation. abstract: Advancements in Long Short Term Memory (LSTM) Networks have shown remarkable success in various Natural Language Generation (NLG) tasks. However, generating sequence from pre-specified lexical constraints is a new, challenging and less researched area in NLG. Lexical constraints take the form of words in the language model’s output to create fluent and meaningful sequences. Furthermore, most of the previous approaches cater this problem by allowing the inclusion of pre-specified lexical constraints during the decoding process, which increases the decoding complexity exponentially or linearly with the number of constraints. Moreover, some of the previous approaches can only deal with single constraint. Additionally, most of the previous approaches only deal with single constraints. In this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. Experiments shows that our proposed architecture outperforms previous methods in terms of intrinsic evaluation. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7256622/ doi: 10.1007/978-3-030-49186-4_4 id: cord-331698-rwow1ydx author: Latorre-Pérez, Adriel title: A lab in the field: applications of real-time, in situ metagenomic sequencing date: 2020-08-20 words: 6732.0 sentences: 335.0 pages: flesch: 36.0 cache: ./cache/cord-331698-rwow1ydx.txt txt: ./txt/cord-331698-rwow1ydx.txt summary: This review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis. Therefore, the ultra-portability, affordability, and speed in data production make the MinION technology suitable for real-time sequencing in a variety of environments, such as Ebola surveillance in West Africa during the last outbreak [25] , microbial communities inspection in the Arctic [26] , DNA sequencing on the International Space Station (ISS) [27] , and even the recently emerging pandemic coronavirus SARS-CoV-2 [28, 29] . In fact, there are some critical points to be addressed before this technique could become a standard in the industry: (i) sequencing cost should be reduced; (ii) rapid and reliable in situ DNA extraction and library preparation protocols should be designed and validated; (iii) minimal sequencing yields should be determined for each specific application; (iv) fast and real-time pipelines should be created and tested; and (v) level of expertise for managing the data and the samples should be notably reduced. abstract: High-throughput metagenomic sequencing is considered one of the main technologies fostering the development of microbial ecology. Widely used second-generation sequencers have enabled the analysis of extremely diverse microbial communities, the discovery of novel gene functions, and the comprehension of the metabolic interconnections established among microbial consortia. However, the high cost of the sequencers and the complexity of library preparation and sequencing protocols still hamper the application of metagenomic sequencing in a vast range of real-life applications. In this context, the emergence of portable, third-generation sequencers is becoming a popular alternative for the rapid analysis of microbial communities in particular scenarios, due to their low cost, simplicity of operation, and rapid yield of results. This review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis. url: https://doi.org/10.1093/biomethods/bpaa016 doi: 10.1093/biomethods/bpaa016 id: cord-252347-vnn4135b author: Lee, Wai-Ming title: A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants date: 2007-10-03 words: 5672.0 sentences: 271.0 pages: flesch: 51.0 cache: ./cache/cord-252347-vnn4135b.txt txt: ./txt/cord-252347-vnn4135b.txt summary: METHODS AND FINDINGS: To directly type HRVs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5'' noncoding region with homologous sequences of the 101 known serotypes. The degenerate primers EV292 and EV222 for PCR amplification of NIm-1A region were not sensitive enough for direct detection of small amount of HRV in original clinical samples (data not shown), and high titer infected cell lysates of cultured isolates were needed to produce enough PCR product for cloning and sequencing. This new assay had 3 key components: sensitive pan-HRV primers and semi-nested PCR to amplify P1-P2 region from cDNA prepared from original clinical specimens, a sequence database of 260-bp P1-P2 region of 5''NCR of all 101 HRV serotypes to serve as standard references for HRV identification, and phylogenetic tree reconstruction of the new P1-P2 sequences and the 101 homologous reference sequences. abstract: BACKGROUND: Human rhinoviruses (HRVs) are the most prevalent human pathogens, and consist of 101 serotypes that are classified into groups A and B according to sequence variations. HRV infections cause a wide spectrum of clinical outcomes ranging from asymptomatic infection to severe lower respiratory symptoms. Defining the role of specific strains in various HRV illnesses has been difficult because traditional serology, which requires viral culture and neutralization tests using 101 serotype-specific antisera, is insensitive and laborious. METHODS AND FINDINGS: To directly type HRVs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5' noncoding region with homologous sequences of the 101 known serotypes. Nasal samples from 26 infants were first tested with a multiplex PCR assay for respiratory viruses, and HRV was the most common virus found (108 of 181 samples). Typing was completed for 101 samples and 103 HRVs were identified. Surprisingly, 54 (52.4%) HRVs did not match any of the known serotypes and had 12–35% nucleotide divergence from the nearest reference HRVs. Of these novel viruses, 9 strains (17 HRVs) segregated from HRVA, HRVB and human enterovirus into a distinct genetic group (“C”). None of these new strains could be cultured in traditional cell lines. CONCLUSIONS: By molecular analysis, over 50% of HRV detected in sick infants were previously unrecognized strains, including 9 strains that may represent a new HRV group. These findings indicate that the number of HRV strains is considerably larger than the 101 serotypes identified with traditional diagnostic techniques, and provide evidence of a new HRV group. url: https://www.ncbi.nlm.nih.gov/pubmed/17912345/ doi: 10.1371/journal.pone.0000966 id: cord-338207-60vrlrim author: Lefkowitz, E.J. title: Virus Databases date: 2008-07-30 words: 7957.0 sentences: 368.0 pages: flesch: 48.0 cache: ./cache/cord-338207-60vrlrim.txt txt: ./txt/cord-338207-60vrlrim.txt summary: (Each arrow points to the table containing the primary key.) Tables are color-coded according to the source of the information they contain: yellow, data obtained from the original GenBank sequence record and the ICTV Eighth Report; pink, data obtained from automated annotation or manual curation; blue, controlled vocabularies to ensure data consistency; green, administrative data. While most of us store our BLAST search results as files on our desktop computers, it is useful to store this information within the database to provide rapid access to similarity results for comparative purposes; to use these results to assign genes to orthologous families of related sequences; and to use these results in applications that analyze data in the database and, for example, display the results of an analysis between two or more types of viruses showing shared sets of common genes. abstract: As tools and technologies for the analysis of biological organisms (including viruses) have improved, the amount of raw data generated by these technologies has increased exponentially. Today's challenge, therefore, is to provide computational systems that support data storage, retrieval, display, and analysis in a manner that allows the average researcher to mine this information for knowledge pertinent to his or her work. Every article in this encyclopedia contains knowledge that has been derived in part from the analysis of such large data sets, which in turn are directly dependent on the databases that are used to organize this information. Fortunately, continual improvements in data-intensive biological technologies have been matched by the development of computational technologies, including those related to databases. This work forms the basis of many of the technologies that encompass the field of bioinformatics. This article provides an overview of database structure and how that structure supports the storage of biological information. The different types of data associated with the analysis of viruses are discussed, followed by a review of some of the various online databases that store general biological, as well as virus-specific, information. url: https://api.elsevier.com/content/article/pii/B9780123744104007196 doi: 10.1016/b978-012374410-4.00719-6 id: cord-342785-55r01n0x author: Lemmon, Gordon H title: Predicting the sensitivity and specificity of published real-time PCR assays date: 2008-09-25 words: 4317.0 sentences: 239.0 pages: flesch: 52.0 cache: ./cache/cord-342785-55r01n0x.txt txt: ./txt/cord-342785-55r01n0x.txt summary: METHODS: We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. This analysis must include the predicted false negative and false positive rates for the developed signatures, and consider all available public sequence data. A freely available real time PCR analysis tool called TaqSim [4] was used to find public sequences that would match the primer/probe assay in question. However, according to the genomic data available, a better match of primers and probes to target is possible and is usually desired for high sensitivity detection. Current real-time PCR assay design approaches produce signatures with sensitivities generally too low for clinical use. Fifty Seven TaqMan PCR primer/probe combinations we predict to have higher sensitivity/specificity than current published assays. Development of quantitative gene-specific real-time RT-PCR assays for the detection of measles virus in clinical specimens abstract: BACKGROUND: In recent years real-time PCR has become a leading technique for nucleic acid detection and quantification. These assays have the potential to greatly enhance efficiency in the clinical laboratory. Choice of primer and probe sequences is critical for accurate diagnosis in the clinic, yet current primer/probe signature design strategies are limited, and signature evaluation methods are lacking. METHODS: We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. We found real-time PCR signatures described in recent literature and used a BLAST search based approach to collect all hits to the primer-probe combinations that should be amplified by real-time PCR chemistry. We then compared our hits with the sequences in the NCBI taxonomy tree that the signature was designed to detect. RESULTS: We found that many published signatures have high specificity (almost no false positives) but low sensitivity (high false negative rate). Where high sensitivity is needed, we offer a revised methodology for signature design which may designate that multiple signatures are required to detect all sequenced strains. We use this methodology to produce new signatures that are predicted to have higher sensitivity and specificity. CONCLUSION: We show that current methods for real-time PCR assay design have unacceptably low sensitivities for most clinical applications. Additionally, as new sequence data becomes available, old assays must be reassessed and redesigned. A standard protocol for both generating and assessing the quality of these assays is therefore of great value. Real-time PCR has the capacity to greatly improve clinical diagnostics. The improved assay design and evaluation methods presented herein will expedite adoption of this technique in the clinical lab. url: https://www.ncbi.nlm.nih.gov/pubmed/18817537/ doi: 10.1186/1476-0711-7-18 id: cord-321386-u1imic5l author: Li, Chun title: Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation date: 2018-02-17 words: 5503.0 sentences: 311.0 pages: flesch: 59.0 cache: ./cache/cord-321386-u1imic5l.txt txt: ./txt/cord-321386-u1imic5l.txt summary: METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. Numerical characterization of protein sequences based on the generalized Chou''s pseudo amino acid composition abstract: AIM AND OBJECTIVE: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. RESULTS: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82-33.85% in terms of F1M. CONCLUSION: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins. url: https://doi.org/10.2174/1386207321666180130100838 doi: 10.2174/1386207321666180130100838 id: cord-306725-0vam15pt author: Li, Hao title: First detection and genomic characteristics of bovine torovirus in dairy calves in China date: 2020-05-09 words: 3015.0 sentences: 156.0 pages: flesch: 58.0 cache: ./cache/cord-306725-0vam15pt.txt txt: ./txt/cord-306725-0vam15pt.txt summary: Sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the S protein compared to the complete S sequences of BToV available in the GenBank database. A phylogenetic analysis based on the complete amino acid sequence of the S protein showed that the BToVs could be separated into four groups (Fig. 2) , designated tentatively as group 1 to group 4. The bovine torovirus strains BToV/SC-1/China and BToV /SC-2/China investigated in this study are indicated by black triangles Fig. 2 Phylogenetic tree based on the deduced 1586-aa sequence of the complete S gene. Moreover, the two Chinese strains shared identical unique amino acid changes in the S and HE genes when compared to the other strains with sequences available in the GenBank database, indicating the unique evolution in Chinese BToV strains. Moreover, two complete BToV genome sequences were obtained from the clinical samples, and these two BToV isolates had unique amino acid changes in the S and HE proteins. abstract: Bovine torovirus (BToV) is a diarrhea-causing pathogen. In this study, 92 diarrheic fecal samples from five farms in four provinces in China were collected and tested for BToV using a RT-PCR assay, and 21.73% samples were found to be BToV positive. Moreover, two complete BToV genome sequences (MN073058 and MN073059) were obtained from the clinical samples, which were 28,297 and 28,301 nucleotides in length, respectively. Sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the S protein compared to the complete S sequences of BToV available in the GenBank database. In addition, seven consecutive amino acid mutations were found from aa 1,486 to 1,492 in the S protein of isolate MN073058. Moreover, the two isolates shared one identical amino acid mutation in the receptor binding sites of the HE protein. To the best of our knowledge, this is the first report on the epidemic and genomic characterization of BToV in China, which is helpful for further understanding the genetic evolution of BToV. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s00705-020-04657-9) contains supplementary material, which is available to authorized users. url: https://doi.org/10.1007/s00705-020-04657-9 doi: 10.1007/s00705-020-04657-9 id: cord-341879-vubszdp2 author: Li, Lucy M title: Genomic analysis of emerging pathogens: methods, application and future trends date: 2014-11-22 words: 5029.0 sentences: 253.0 pages: flesch: 36.0 cache: ./cache/cord-341879-vubszdp2.txt txt: ./txt/cord-341879-vubszdp2.txt summary: In this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases. In this review, we provide an overview of recent developments in genomic methods in the context of infectious diseases, evaluate integrative methods that incorporate genetic data in epidemiological analysis, and discuss the application of these methods to EIDs. Over the last two decades, sequence data have increased in quality, length and volume due to improvements in the underlying technology and decreasing costs. In recent cases of EIDs, genomic data have helped to classify and characterize the pathogen, uncover the population history of the disease, and produce estimates of epidemiological parameters. Just as compartmental models can be fitted to surveillance data to infer the epidemiological dynamics of an infectious disease (Box 1), the coalescent framework allows inference of population history from pathogen sequences. abstract: The number of emerging infectious diseases is increasing. Characterizing novel or re-emerging infections is aided by the availability of pathogen genomes. In this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases. url: https://www.ncbi.nlm.nih.gov/pubmed/25418281/ doi: 10.1186/s13059-014-0541-9 id: cord-345552-h6fwi0qn author: Li, Q.-G. title: Hydropathic characteristics of adenovirus hexons date: 1997-07-01 words: 3522.0 sentences: 206.0 pages: flesch: 53.0 cache: ./cache/cord-345552-h6fwi0qn.txt txt: ./txt/cord-345552-h6fwi0qn.txt summary: The strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. The sequence of the predicted protein, consisting of 937 amino acids, was obtained with the LaserGene software program EditSeq. The hydropathy data of hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 were derived using the prediction method of Kyte-Doolittle in the LaserGene computer program Protean. The nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed serotypes of subgenera B, D and E to be closely related (Table 3 and Fig. 2) . DNA sequence of the adenovirus type 41 hexon gene and predicted structure of the protein abstract: The complete nucleotide sequence and the predicted amino acid sequence of the adenovirus type 7 hexon gene were determined. The hydro-pathy of the hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 was analysed. The presence of purines and pyrimid-ines in the second position of the codons was correlated to hydrophilicity and hydrophobicity, respectively. Comparison of the hydrophilicity plots of eight hexons showed seven hypervariable regions to be distributed on the surface. A large portion of the hypervariable regions manifests hydrophilicity. The strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. Analysis of codon usage for adenovirus hexons showed that among synony-mous codons those with cytidine in the third position were preferably used to a great extent. Analysis of the nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed members of subgenera B, D and E to be closely related, especially Ad4 and Ad16, and subgenus A to be closely related to subgenus F. url: https://www.ncbi.nlm.nih.gov/pubmed/9267445/ doi: 10.1007/s007050050162 id: cord-001537-i34vmfpp author: Lima, Francisco Esmaile de Sales title: Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil date: 2015-02-17 words: 3874.0 sentences: 195.0 pages: flesch: 53.0 cache: ./cache/cord-001537-i34vmfpp.txt txt: ./txt/cord-001537-i34vmfpp.txt summary: The predicted protein sequences encoded by ORF2 (cap) and ORF1 (rep) of BatCV I-VI genomes were used for phylogenetic analysis with representative and recently discovered circoviruses/cycloviruses; Pepper golden mosaic virus was used as outgroup, as they are somewhat related to other members in the Circoviridae family (Fig. 3A, 3B and 3C ). The phylogenetic analysis constructed based on the alignments of the complete REP and CAP protein confirms that BatCV POA/II and VI cluster into the genus Cyclovirus along with the Chinese cycloviruses sequences clade detected in bat feces [18] and sharing less than 65% of identity at the CAP/REP amino acid level. BatCV POA I and V had a low amino acid identity with CAP (<20%) and REP (<10%) sequences of two other sequences detected in bat feces in this study with known circoviruses/cycloviruses (Table 2) . abstract: Circoviruses are highly prevalent porcine and avian pathogens. In recent years, novel circular ssDNA genomes have recently been detected in a variety of fecal and environmental samples using deep sequencing approaches. In this study the identification of genomes of novel circoviruses and cycloviruses in feces of insectivorous bats is reported. Pan-reactive primers were used targeting the conserved rep region of circoviruses and cycloviruses to screen DNA bat fecal samples. Using this approach, partial rep sequences were detected which formed five phylogenetic groups distributed among the Circovirus and the recently proposed Cyclovirus genera of the Circoviridae. Further analysis using inverse PCR and Sanger sequencing led to the characterization of four new putative members of the family Circoviridae with genome size ranging from 1,608 to 1,790 nt, two inversely arranged ORFs, and canonical nonamer sequences atop a stem loop. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331541/ doi: 10.1371/journal.pone.0118070 id: cord-330312-1pjolkql author: Liu, Y.-T. title: Infectious Disease Genomics date: 2017-01-20 words: 5168.0 sentences: 327.0 pages: flesch: 45.0 cache: ./cache/cord-330312-1pjolkql.txt txt: ./txt/cord-330312-1pjolkql.txt summary: One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. 16, 17 The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002. 30e32 Genome-sequencing projects for other important human disease vectors are in progress. 38 One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. 48 The completed or ongoing genome projects (Table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. Genome sequence of the human malaria parasite Plasmodium falciparum abstract: The history and development of infectious disease genomics have been closely associated with the Human Genome Project (HGP) during the past 20 years. It has been emphasized since the beginning of the HGP that such effort must not be restricted to the human genome and should include other organisms including mouse, bacteria, yeast, fruit fly, and worm for comparative sequence analyses. A brief history is reviewed in this chapter. As of 2016, more than 7000 completed genome sequencing projects have been reported. One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. A number of examples are discussed in this chapter. url: https://www.sciencedirect.com/science/article/pii/B978012799942500010X doi: 10.1016/b978-0-12-799942-5.00010-x id: cord-265857-fs6dj3dp author: Liu, Yu-Tsueng title: Infectious Disease Genomics date: 2010-12-24 words: 4341.0 sentences: 233.0 pages: flesch: 45.0 cache: ./cache/cord-265857-fs6dj3dp.txt txt: ./txt/cord-265857-fs6dj3dp.txt summary: The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002 (Gardner et al., 2002; Holt et al., 2002) . Genome sequencing projects for other important human disease vectors are in progress Megy et al., 2009 ). One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. The completed or ongoing genome projects (Table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. abstract: The history and development of infectious disease genomics are discussed in this chapter. HGP must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The polysaccharide capsule is important for meningococci to escape from complement-mediated killing. With the completion of the genome sequence of a virulent MenB strain, a “reverse vaccinology” approach was applied for the development of a universal MenB vaccine by Novartis. The indispensable fatty acid synthase (FAS) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents. Through a systematic screening of 250,000 natural product extracts, a Merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from Streptomyces platensis. Vector Biology Network was formed to achieve three goals (1) to develop basic tools for the stable transformation of anopheline mosquitoes by the year 2000; (2) to engineer a mosquito incapable of carrying the malaria parasite by 2005; and (3) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by 2010. The most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis. url: https://www.sciencedirect.com/science/article/pii/B9780123848901000108 doi: 10.1016/b978-0-12-384890-1.00010-8 id: cord-287658-c2lljdi7 author: Lopez-Rincon, Alejandro title: Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning date: 2020-09-10 words: 4766.0 sentences: 253.0 pages: flesch: 55.0 cache: ./cache/cord-287658-c2lljdi7.txt txt: ./txt/cord-287658-c2lljdi7.txt summary: The discovered sequences are first validated on samples from other repositories, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. The discovered sequences are validated on samples from NCBI and GISAID, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. For example, we can use this sequencing data with cDNA, resulting from the PCR of the original viral RNA; e,g, Real-Time PCR amplicons to identify the SARS-CoV-2 16 . The global impact of SARS-CoV-2 prompted researchers to apply effective alignment-free methods to the classification of the virus: For example, in 26 the authors propose the use of Machine Learning Digital Signal Processing for separating the virus from similar strains, with remarkable accuracy. We calculated the frequency of appearance of different primer sets'' sequences used in SARS-CoV-2 RT-PCR tests developed by WHO referral laboratories and compared it to our primer design in the dataset from the GISAID ( Table 2) repository. abstract: In this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in SARS-CoV-2. A convolutional neural network classifier is first trained on 553 sequences from available repositories, separating the genome of different virus strains from the Coronavirus family with considerable accuracy. The network’s behavior is then analyzed, to discover sequences used by the model to identify SARS-CoV-2, ultimately uncovering sequences exclusive to it. The discovered sequences are first validated on samples from other repositories, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. Next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets on existing datasets, obtaining competitive results. Finally, the primer is synthesized and tested on patient samples (n=6 previously tested positive), delivering a sensibility similar to routine diagnostic methods, and 100% specificity. In this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in SARS-CoV-2. A convolutional neural network classifier is first trained on 553 sequences from NGDC, separating the genome of different virus strains from the Coronavirus family with accuracy 98.73%. The network’s behavior is then analyzed, to discover sequences used by the model to identify SARS-CoV-2, ultimately uncovering sequences exclusive to it. The discovered sequences are validated on samples from NCBI and GISAID, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. Next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets, obtaining competitive results. Finally, the primer is synthesized and tested on patient samples (n=6 previously tested positive), delivering a sensibility similar to routine diagnostic methods, and 100% specificity. The proposed methodology has a substantial added value over existing methods, as it is able to both identify promising primer sets for a virus from a limited amount of data, and deliver effective results in a minimal amount of time. Considering the possibility of future pandemics, these characteristics are invaluable to promptly create specific detection methods for diagnostics. url: https://doi.org/10.1101/2020.03.13.990242 doi: 10.1101/2020.03.13.990242 id: cord-302161-ytr7ds8i author: Lutz, Mirjam title: FCoV Viral Sequences of Systemically Infected Healthy Cats Lack Gene Mutations Previously Linked to the Development of FIP date: 2020-07-24 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: Feline Infectious Peritonitis (FIP)—the deadliest infectious disease of young cats in shelters or catteries—is induced by highly virulent feline coronaviruses (FCoVs) emerging in infected hosts after mutations of less virulent FCoVs. Previous studies have shown that some mutations in the open reading frames (ORF) 3c and 7b and the spike (S) gene have implications for the development of FIP, but mainly indirectly, likely also due to their association with systemic spread. The aim of the present study was to determine whether FCoV detected in organs of experimentally FCoV infected healthy cats carry some of these mutations. Viral RNA isolated from different tissues of seven asymptomatic cats infected with the field strains FCoV Zu1 or FCoV Zu3 was sequenced. Deletions in the 3c gene and mutations in the 7b and S genes that have been shown to have implications for the development of FIP were not detected, suggesting that these are not essential for systemic viral dissemination. However, deletions and single nucleotide polymorphisms leading to truncations were detected in all nonstructural proteins. These were found across all analyzed ORFs, but with significantly higher frequency in ORF 7b than ORF 3a. Additionally, a previously unknown homologous recombination site was detected in FCoV Zu1. url: https://doi.org/10.3390/pathogens9080603 doi: 10.3390/pathogens9080603 id: cord-025948-6dsx7pey author: Maitra, Arindam title: Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility date: 2020-06-04 words: 7218.0 sentences: 382.0 pages: flesch: 56.0 cache: ./cache/cord-025948-6dsx7pey.txt txt: ./txt/cord-025948-6dsx7pey.txt summary: Direct massively parallel sequencing of SARS-CoV-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in Eastern India. We have initiated a study on sequencing of SARS-CoV-2 genome from swab samples obtained from infected individuals from different regions of West Bengal in Eastern India and report here the first nine sequences and the results of analysis of the sequence data with respect to other sequences reported from the country until date. The A2a clade is characterized by the signature nonsynonymous mutations leading to amino acid changes of P323L in the RdRp which is involved in replication of the viral genome and the change of D614G in the Spike glycoprotein which is essential for the entry of the virus in the host cell by binding to the ACE2 receptor. We have also detected emergence of mutations in the important regions of the viral genome including Spike, RdRP and nucleocapsid coding genes. abstract: Direct massively parallel sequencing of SARS-CoV-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in Eastern India. Seven of the isolates belonged to the A2a clade, while one belonged to the B4 clade. Specific mutations, characteristic of the A2a clade, were also detected, which included the P323L in RNA-dependent RNA polymerase and D614G in the Spike glycoprotein. Further, our data revealed emergence of novel subclones harbouring nonsynonymous mutations, viz. G1124V in Spike (S) protein, R203K, and G204R in the nucleocapsid (N) protein. The N protein mutations reside in the SR-rich region involved in viral capsid formation and the S protein mutation is in the S(2) domain, which is involved in triggering viral fusion with the host cell membrane. Interesting correlation was observed between these mutations and travel or contact history of COVID-19 positive cases. Consequent alterations of miRNA binding and structure were also predicted for these mutations. More importantly, the possible implications of mutation D614G (in S(D) domain) and G1124V (in S(2) subunit) on the structural stability of S protein have also been discussed. Results report for the first time a bird’s eye view on the accumulation of mutations in SARS-CoV-2 genome in Eastern India. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s12038-020-00046-1) contains supplementary material, which is available to authorized users. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7269891/ doi: 10.1007/s12038-020-00046-1 id: cord-010161-bcuec2fz author: Matson, David O. title: IV, 6. Calicivirus RNA recombination date: 2004-09-14 words: 3335.0 sentences: 168.0 pages: flesch: 45.0 cache: ./cache/cord-010161-bcuec2fz.txt txt: ./txt/cord-010161-bcuec2fz.txt summary: With the description of statistically significant phylogenetic clades within CV genera, data were available to recognize strains that might be natural recombinants within CVs. Two examples are the well-characterized Argentine strain 320 (Arg320) and Snow Mountain virus (SMV), one of the prototype CVs, recognized to be recombinants when the RNA polymerase and capsid regions of these strains were characterized (Hardy et al., 1997; Jiang et al., 1999) (Fig. 2) . While SMV was likely also to be a recombinant virus, the capsid and RNA polymerase region amplicons of SMV were generated separately and that fact did not exclude the possibility of different sources of strains. Infection of single cells simultaneously by two CVs implies absence of immune or molecular and of 40 nt near the 5'' end of that strain''s capsid gene (ID="B" sequence for this Fig.) . The sequence data indicated that recombination in strain Arg320 occurred at the ORF1/capsid gene junction where high sequence identity exists between the putative parent clades. abstract: RNA recombination apparently contributed to the evolution of CVs. Nucleic acid sequence homology or identity and similar RNA secondary structure of CVs and non-CVs may provide a locus for recombination within CVs or with non-CVs should co-infections of the same cell occur. Natural recombinants have been demonstrated among other enteric viruses, including Picornaviridae (Kirkegaard and Baltimore, 1986; Furione et al., 1993), Astroviridae (Walter et al., 2001), and possibly rotaviruses (e.g., Desselberger, 1996; Suzuki et al., 1998), augmenting the natural diversity of these pathogens and complicating viral gastroenteritis prevention strategies based upon traditional vaccines. Such is the case for CVs and Astroviridae, whose recombinant strains may be a common portion of naturally circulating strains. The taxonomic — and perhaps biologic — limits of recombination are defined by the suggested recombination of Nanovirus and CV, viruses from hosts of different biologic orders; the relationship of picornaviruses and CVs, viruses in different families, as recombination partners; and the intra-generic recombination between different clades of NLVs. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7172178/ doi: 10.1016/s0168-7069(03)09032-3 id: cord-275258-azpg5yrh author: Mead, Dylan J.T. title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling date: 2019-07-26 words: 6333.0 sentences: 346.0 pages: flesch: 53.0 cache: ./cache/cord-275258-azpg5yrh.txt txt: ./txt/cord-275258-azpg5yrh.txt summary: title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. We then present the first use of force-directed graphs to produce an intuitive visualization of sequence space, and select target RdRPs without solved structures for homology modelling. The solved structure has 10 other sequences in its proximity in the three-dimensional space, roughly Table 5 Homology modelling at intra-order, inter-family level. abstract: The protein sequence-structure gap results from the contrast between rapid, low-cost deep sequencing, and slow, expensive experimental structure determination techniques. Comparative homology modelling may have the potential to close this gap by predicting protein structure in target sequences using existing experimentally solved structures as templates. This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. Homology modelling was then carried out for target-template pairs in different species, different genera and different families, and model quality assessed using several metrics. Reconstructed ancestral RdRP sequences for individual genera were also used as templates for the production of ancestral RdRP homology models. High quality ancestral RdRP models were consistently produced, as were good quality models for target-template pairs in the same genus. Homology modelling between genera in the same family produced mixed results and inter-family modelling was unreliable. We present a protocol for the production of optimal RdRP homology models for use in further experiments, e.g. docking to discover novel anti-viral compounds. (219 words) url: https://www.sciencedirect.com/science/article/pii/S109332631930333X doi: 10.1016/j.jmgm.2019.07.014 id: cord-027316-echxuw74 author: Modarresi, Kourosh title: Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model date: 2020-05-22 words: 2116.0 sentences: 148.0 pages: flesch: 49.0 cache: ./cache/cord-027316-echxuw74.txt txt: ./txt/cord-027316-echxuw74.txt summary: This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. Deep Learning has become a main model in natural language processing applications [6, 7, 11, 22, 38, 55, 64, 71, 75, 78-81, 85, 88, 94] . Though, modified version of RNN like LSTM and GRU have been improvement over RNN (recurrent neural networks) in dealing with vanishing gradients and long-term memory loss, still they suffer from many deficiencies. Given the complexity of these dependencies, a neural network model is used to compute these weights. The embedding regularization is, α Embedding Error 2 (6) Input to any model has to be a number and hence the raw input of words or text sequence needs to be transformed to continuous numbers. Learning phrase representations using RNN encoder-decoder for statistical machine translation abstract: Every individual text or document is generated for specific purpose(s). Sometime, the text is deployed to convey a specific message about an event or a product. Other occasions, it may be communicating a scientific breakthrough, development or new model and so on. Given any specific objective, the creators and the users of documents may like to know which part(s) of the documents are more influential in conveying their specific messages or achieving their objectives. Understanding which parts of a document has more impact on the viewer’s perception would allow the content creators to design more effective content. Detecting the more impactful parts of a content would help content users, such as advertisers, to concentrate their efforts more on those parts of the content and thus to avoid spending resources on the rest of the document. This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7304011/ doi: 10.1007/978-3-030-50420-5_20 id: cord-325750-x7jpsnxg author: Mokili, John L title: Metagenomics and future perspectives in virus discovery date: 2012-01-20 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: Monitoring the emergence and re-emergence of viral diseases with the goal of containing the spread of viral agents requires both adequate preparedness and quick response. Identifying the causative agent of a new epidemic is one of the most important steps for effective response to disease outbreaks. Traditionally, virus discovery required propagation of the virus in cell culture, a proven technique responsible for the identification of the vast majority of viruses known to date. However, many viruses cannot be easily propagated in cell culture, thus limiting our knowledge of viruses. Viral metagenomic analyses of environmental samples suggest that the field of virology has explored less than 1% of the extant viral diversity. In the last decade, the culture-independent and sequence-independent metagenomic approach has permitted the discovery of many viruses in a wide range of samples. Phylogenetically, some of these viruses are distantly related to previously discovered viruses. In addition, 60–99% of the sequences generated in different viral metagenomic studies are not homologous to known viruses. In this review, we discuss the advances in the area of viral metagenomics during the last decade and their relevance to virus discovery, clinical microbiology and public health. We discuss the potential of metagenomics for characterization of the normal viral population in a healthy community and identification of viruses that could pose a threat to humans through zoonosis. In addition, we propose a new model of the Koch's postulates named the ‘Metagenomic Koch's Postulates’. Unlike the original Koch's postulates and the Molecular Koch's postulates as formulated by Falkow, the metagenomic Koch's postulates focus on the identification of metagenomic traits in disease cases. The metagenomic traits that can be traced after healthy individuals have been exposed to the source of the suspected pathogen. url: https://doi.org/10.1016/j.coviro.2011.12.004 doi: 10.1016/j.coviro.2011.12.004 id: cord-000642-mkwpuav6 author: Moreira, Rebeca title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing date: 2012-04-19 words: 6848.0 sentences: 372.0 pages: flesch: 45.0 cache: ./cache/cord-000642-mkwpuav6.txt txt: ./txt/cord-000642-mkwpuav6.txt summary: title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum. Moreover, a few transcripts encoded by genes putatively involved in the clam immune response against Perkinsus olseni have been reported by cDNA library sequencing [18] . philippinarum transcriptome and another four bivalve species sequences were analyzed by comparative genomics (Crassostrea gigas of the family Ostreidae, Bathymodiolus azoricus and Mytilus galloprovincialis of the family Mytilidae and Laternula elliptica of the family Laternulidae). abstract: BACKGROUND: The Manila clam (Ruditapes philippinarum) is a worldwide cultured bivalve species with important commercial value. Diseases affecting this species can result in large economic losses. Because knowledge of the molecular mechanisms of the immune response in bivalves, especially clams, is scarce and fragmentary, we sequenced RNA from immune-stimulated R. philippinarum hemocytes by 454-pyrosequencing to identify genes involved in their immune defense against infectious diseases. METHODOLOGY AND PRINCIPAL FINDINGS: High-throughput deep sequencing of R. philippinarum using 454 pyrosequencing technology yielded 974,976 high-quality reads with an average read length of 250 bp. The reads were assembled into 51,265 contigs and the 44.7% of the translated nucleotide sequences into protein were annotated successfully. The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. We have found sequences from molecules never described in bivalves before, especially in the complement pathway where almost all the components are present. CONCLUSIONS: This study represents the first transcriptome analysis using 454-pyrosequencing conducted on R. philippinarum focused on its immune system. Our results will provide a rich source of data to discover and identify new genes, which will serve as a basis for microarray construction and the study of gene expression as well as for the identification of genetic markers. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334963/ doi: 10.1371/journal.pone.0035009 id: cord-311240-o0zyt2vb author: Motayo, Babatunde Olarenwaju title: Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences date: 2020-07-27 words: 3091.0 sentences: 167.0 pages: flesch: 50.0 cache: ./cache/cord-311240-o0zyt2vb.txt txt: ./txt/cord-311240-o0zyt2vb.txt summary: Our study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa. The pathogen was later identified to be a novel coronavirus closely related to the severe acute respiratory syndrome virus (SARS), with a possible bat origin (Zhou et al, 2020) . This study was designed to determine to the genetic diversity and evolutionary history of genome sequences of SARSCoV-2 isolated in Africa. Results of recombination analysis of the African SARSCoV-2 (AfrSARSCoV-2) sequences against references whole genome sequences of SARS, Recombination signals were observed between the African SARSCoV-2 sequences and reference sequence (Major recombinant hCoV-19 Pangolin/Guangu P4L/2017; Minor parent hCoV-19 B batYunan/RaTG13) between the RdRP and S gene regions (Figure 2 ). abstract: The ongoing SARSCoV-2 pandemic was introduced into Africa on 14th February 2020 and has rapidly spread across the continent causing severe public health crisis and mortality. We investigated the genetic diversity and evolution of this virus during the early outbreak months using whole genome sequences. We performed; recombination analysis against closely related CoV, Bayesian time scaled phylogeny and investigated spike protein amino acid mutations. Results from our analysis showed recombination signals between the AfrSARSCoV-2 sequences and reference sequences within the N and S genes. The evolutionary rate of the AfrSARSCoV-2 was 4.133 × 10−4 high posterior density HPD (4.132 × 10−4 to 4.134 × 10−4) substitutions/site/year. The time to most recent common ancestor TMRCA of the African strains was December 7th 2019. The AfrSARCoV-2 sequences diversified into two lineages A and B with B being more diverse with multiple sub-lineages confirmed by both maximum clade credibility MCC tree and PANGOLIN software. There was a high prevalence of the D614-G spike protein amino acid mutation (82.61%) among the African strains. Our study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa. url: https://doi.org/10.1101/2020.07.27.222901 doi: 10.1101/2020.07.27.222901 id: cord-018459-isbc1r2o author: Munjal, Geetika title: Phylogenetics Algorithms and Applications date: 2018-12-10 words: 1851.0 sentences: 122.0 pages: flesch: 42.0 cache: ./cache/cord-018459-isbc1r2o.txt txt: ./txt/cord-018459-isbc1r2o.txt summary: This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences. These alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15] . Constructing phylogenetic trees using multiple sequence alignment abstract: Phylogenetics is a powerful approach in finding evolution of current day species. By studying phylogenetic trees, scientists gain a better understanding of how species have evolved while explaining the similarities and differences among species. The phylogenetic study can help in analysing the evolution and the similarities among diseases and viruses, and further help in prescribing their vaccines against them. This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. The paper has also discussed the application of phylogenetic study in disease diagnosis and evolution. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7123334/ doi: 10.1007/978-981-13-5934-7_17 id: cord-264746-gfn312aa author: Muse, Spencer title: GENOMICS AND BIOINFORMATICS date: 2012-03-29 words: 10976.0 sentences: 583.0 pages: flesch: 58.0 cache: ./cache/cord-264746-gfn312aa.txt txt: ./txt/cord-264746-gfn312aa.txt summary: The success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of DNA and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of DNA; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. Although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today''s environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism''s genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research. abstract: This chapter discusses the basic principles of molecular biology regarding genome science and describes the major types of data involved in genome projects, including technologies for collecting them. Genome science is heavily driven by new technological advances that allow for rapid and inexpensive collection of various types of data. The emergence of genomic science has not simply provided a rich set of tools and data for studying molecular biology. It has been the catalyst for an astounding burst of interdisciplinary research, and it has challenged long-established hierarchies found in most institutions of higher learning. The next generation of biologists needs to be as comfortable at a computer workstation as they are at the lab bench. Recognizing this fact, many universities have already reorganized their departments and their curricula to accommodate the demands of genomic science.The chapter discusses practical applications and uses of genomic data. For example, in the foreseeable future, are gene therapies that can repair genetic defects. url: https://api.elsevier.com/content/article/pii/B978012238662650015X doi: 10.1016/b978-0-12-238662-6.50015-x id: cord-321762-7kiahjyy author: Nandy, Ashesh title: Chapter 5 The GRANCH Techniques for Analysis of DNA, RNA and Protein Sequences date: 2015-12-31 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: Abstract: The very rapid growth in molecular sequence data from the daily accretion of large gene and protein sequencing projects have led to issues regarding viewing and analyzing the massive amounts of data. Graphical representation and numerical characterization of DNA, RNA and protein sequences have exhibited great potential to address these concerns. We review here in brief several different formulations of these representations and examples of applications to diverse problems based on what this author had presented at the Second Mathematical Chemistry Workshop of the Americas in Bogota, Colombia in 2010. In particular, we note several insights that were gained from such representations, and the applications to the bio-medicinal field. url: https://api.elsevier.com/content/article/pii/B9781681080536500053 doi: 10.1016/b978-1-68108-053-6.50005-3 id: cord-326225-crtpzad7 author: Neill, John D. title: Simultaneous rapid sequencing of multiple RNA virus genomes date: 2014-06-01 words: 3804.0 sentences: 204.0 pages: flesch: 55.0 cache: ./cache/cord-326225-crtpzad7.txt txt: ./txt/cord-326225-crtpzad7.txt summary: This procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3′-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. There is a wealth of information in these isolates, but up till now, it has been time consuming and expensive to sequence these viral genomes, often requiring sets of strain-specific primers for PCR amplification and sequencing. These primers were developed so that the 20 base known sequence was used for PCR amplification of the library as well as served as a barcode for identifying each viral library following pooling and sequencing. This virus, a BVDV 1b strain isolated from alpaca (GenBank accession JX297520.1; Table 2 , library 3, barcode 10), was assembled from Ion Torrent data and was found to have only 1 base difference from the sequence determined earlier (data not shown). One virus, library 1, barcode 9, had only 658 viral sequence reads but 94.4% of the genome was assembled. abstract: Comparing sequences of archived viruses collected over many years to the present allows the study of viral evolution and contributes to the design of new vaccines. However, the difficulty, time and expense of generating full-length sequences individually from each archived sample have hampered these studies. Next generation sequencing technologies have been utilized for analysis of clinical and environmental samples to identify viral pathogens that may be present. This has led to the discovery of many new, uncharacterized viruses from a number of viral families. Use of these sequencing technologies would be advantageous in examining viral evolution. In this study, a sequencing procedure was used to sequence simultaneously and rapidly multiple archived samples using a single standard protocol. This procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3′-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. This conferred sequence independence by random priming both first and second strand cDNA synthesis. Viral stocks were treated with a nuclease cocktail to reduce the presence of host nucleic acids. Viral RNA was extracted, followed by single tube random-primed double-stranded cDNA synthesis. The resultant cDNAs were amplified by primer-specific PCR, pooled, size fractionated and sequenced on the Ion Torrent PGM platform. The individual virus genomes were readily assembled by both de novo and template-assisted assembly methods. This procedure consistently resulted in near full length, if not full-length, genomic sequences and was used to sequence multiple bovine pestivirus and coronavirus isolates simultaneously. url: https://doi.org/10.1016/j.jviromet.2014.02.016 doi: 10.1016/j.jviromet.2014.02.016 id: cord-014461-2ubh9u8r author: Nelson, Oranmiyan W. title: Genome sequences published outside of Standards in Genomic Sciences, July - October 2012 date: 2012-10-10 words: 4124.0 sentences: 454.0 pages: flesch: 44.0 cache: ./cache/cord-014461-2ubh9u8r.txt txt: ./txt/cord-014461-2ubh9u8r.txt summary: Complete Genome Sequence of Brucella abortus A13334, a New Strain Isolated from the Fetal Gastric Fluid of Dairy Cattle Complete Genome Sequence of Brucella canis Strain HSK A52141, Isolated from the Blood of an Infected Dog Complete Genome Sequence of Streptococcus salivarius PS4, a Strain Isolated from Human Milk Complete Genome Sequences of Probiotic Strains Bifidobacterium animalis subsp. Complete Genome Sequence of Corynebacterium pseudotuberculosis Strain 1/06-A, Isolated from a Horse in North America Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Characterization and Complete Genome Sequence of Human Coronavirus NL63 Isolated in China Complete Genome Sequence of a Novel Pararetrovirus Isolated from Soybean Complete Genome Sequence of a Polyomavirus Isolated from Horses Complete Genome Sequence of a Novel Porcine Sapelovirus Strain YC2011 Isolated from Piglets with Diarrhea Draft Genome Sequence of Aspergillus oryzae Strain 3.042 abstract: The purpose of this table is to provide the community with a citable record of publications of ongoing genome sequencing projects that have led to a publication in the scientific literature. While our goal is to make the list complete, there is no guarantee that we may have omitted one or more publications appearing in this time frame. Readers and authors who wish to have publications added to subsequent versions of this list are invited to provide the bibliographic data for such references to the SIGS editorial office. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3570808/ doi: 10.4056/sigs.3416907 id: cord-016293-pyb00pt5 author: Newell-McGloughlin, Martina title: The flowering of the age of Biotechnology 1990–2000 date: 2006 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: nan url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7120537/ doi: 10.1007/1-4020-5149-2_4 id: cord-255371-o9oxchq6 author: Nguyen, Thanh Thi title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) date: 2020-07-10 words: 5640.0 sentences: 365.0 pages: flesch: 59.0 cache: ./cache/cord-255371-o9oxchq6.txt txt: ./txt/cord-255371-o9oxchq6.txt summary: title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. We use 6,324 SARS-CoV-2 genome sequences collected in 45 countries and deposited to the NCBI GenBank so far and create a spreadsheet dataset of all mutations occurred across different genes. In this paper, to evaluate the possible impacts of genomic mutations on the virus functions, we propose the use of the SSpro/ACCpro 5 methods to predict protein secondary structure and relative solvent accessibility [13] . By comparing the prediction results obtained on the reference genome and mutated genomes, we are able to assess whether the detected mutations have the potential to change the protein structure and solvent accessibility, and thus lead to possible changes of the virus characteristics. abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly pathogenic virus that has caused the global COVID-19 pandemic. Tracing the evolution and transmission of the virus is crucial to respond to and control the pandemic through appropriate intervention strategies. This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. Prediction results suggest that mutation D614G in the virus spike protein, which has attracted much attention from researchers, is unlikely to make changes in protein secondary structure and relative solvent accessibility. Based on 6,324 viral genome sequences, we create a spreadsheet dataset of point mutations that can facilitate the investigation of SARS-CoV-2 in many perspectives, especially in tracing the evolution and worldwide spread of the virus. Our analysis results also show that coding genes E, M, ORF6, ORF7a, ORF7b and ORF10 are most stable, potentially suitable to be targeted for vaccine and drug development. url: https://doi.org/10.1101/2020.07.10.171769 doi: 10.1101/2020.07.10.171769 id: cord-012975-u87ol3fs author: Ogiwara, Atsushi title: Construction of a dictionary of sequence motifs that characterize groups of related proteins date: 1992-09-17 words: 3112.0 sentences: 165.0 pages: flesch: 55.0 cache: ./cache/cord-012975-u87ol3fs.txt txt: ./txt/cord-012975-u87ol3fs.txt summary: An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. The conserved amino acid patterns, often called consensus patterns or sequence motifs (Taylor, 1988; Hodgman, 1989) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. This procedure is applied to the superfamily grouping of the PIR database and a library of sequence motifs is constructed that identifies specific superfamilies. Functional groups of proteins Suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. Because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites. abstract: An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. This procedure is applied to the PIR database and a dictionary of sequence motifs that relate to specific superfamilies constructed. The motifs have a practical relevance in identifying the membership of specific superfamilies without the need to perform sequence database searches in 20% of newly determined sequences. The sequence motifs identified represent functionally important sites on protein molecules. When multiple blocks exist in a single motif they are often close together in the 3-D structure. Furthermore, occasionally these motif blocks were found to be split by introns when the correlation with exon structures was examined. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528547/ doi: 10.1093/protein/5.6.479 id: cord-355075-ieb35upi author: Papenfuss, Anthony T title: The immune gene repertoire of an important viral reservoir, the Australian black flying fox date: 2012-06-20 words: 8952.0 sentences: 480.0 pages: flesch: 54.0 cache: ./cache/cord-355075-ieb35upi.txt txt: ./txt/cord-355075-ieb35upi.txt summary: alecto transcriptome provides information on a variety of immune genes not previously identified in any bat species and represents an important starting point for examining the antiviral activity of these molecules. To enrich for sequences corresponding to cytokines and innate immune genes, the second dataset was derived from pooled total RNA obtained from mitogen-stimulated spleen, white blood cells and lymph node and unstimulated thymus and bone marrow obtained from one pregnant female and one adult male flying fox. A full length transcript, encoding a 667 amino acid protein was identified in our bat transcriptome datasets and found to be orthologous to Mx1 based on comparison with known mammalian Mx1 and Mx2 family members (Figure 4a and data not shown). Genes involved in the adaptive immune system, including MHC class I and II genes and T and B cell receptors and co-receptors were highly represented in both the thymus and pooled datasets providing evidence that bats have all of the components necessary to mount an adaptive immune response. abstract: BACKGROUND: Bats are the natural reservoir host for a range of emerging and re-emerging viruses, including SARS-like coronaviruses, Ebola viruses, henipaviruses and Rabies viruses. However, the mechanisms responsible for the control of viral replication in bats are not understood and there is little information available on any aspect of antiviral immunity in bats. Massively parallel sequencing of the bat transcriptome provides the opportunity for rapid gene discovery. Although the genomes of one megabat and one microbat have now been sequenced to low coverage, no transcriptomic datasets have been reported from any bat species. In this study, we describe the immune transcriptome of the Australian flying fox, Pteropus alecto, providing an important resource for identification of genes involved in a range of activities including antiviral immunity. RESULTS: Towards understanding the adaptations that have allowed bats to coexist with viruses, we have de novo assembled transcriptome sequence from immune tissues and stimulated cells from P. alecto. We identified about 18,600 genes involved in a broad range of activities with the most highly expressed genes involved in cell growth and maintenance, enzyme activity, cellular components and metabolism and energy pathways. 3.5% of the bat transcribed genes corresponded to immune genes and a total of about 500 immune genes were identified, providing an overview of both innate and adaptive immunity. A small proportion of transcripts found no match with annotated sequences in any of the public databases and may represent bat-specific transcripts. CONCLUSIONS: This study represents the first reported bat transcriptome dataset and provides a survey of expressed bat genes that complement existing bat genomic data. In addition, these data provide insight into genes relevant to the antiviral responses of bats, and form a basis for examining the roles of these molecules in immune response to viral infection. url: https://doi.org/10.1186/1471-2164-13-261 doi: 10.1186/1471-2164-13-261 id: cord-304607-td0776wj author: Paszkiewicz, Konrad H. title: Omics, Bioinformatics, and Infectious Disease Research date: 2010-12-24 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: Bioinformatics is basically the study of informatic processes in biotic systems. Actually what constitutes bioinformatics is not entirely clear and arguably varies depending on who tries to define it. This chapter discusses the considerable progress in infectious diseases research that has been made in recent years using various “omics” case studies. Bioinformatics is tasked with making sense of it, mining it, storing it, disseminating it, and ensuring valid biological conclusions can be drawn from it. This chapter discusses the current state of play of bioinformatics related to genomics and transcriptomics, briefs metagenomics that finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms. This chapter explains the various possibilities of pan-genome, transcriptional reshaping and also enormous progress of proteomics study. Bioinformatic algorithms and tools are crucial tools in analyzing the data. The chapter also attempts to provide some details on the various problems and solution in bioinformatics that current-day scientists face while concentrating on second-generation sequencing strategies. url: https://api.elsevier.com/content/article/pii/B9780123848901000182 doi: 10.1016/b978-0-12-384890-1.00018-2 id: cord-264135-s2u76pvk author: Patel, Amrutlal K. title: Complete genome sequence analysis of chicken astrovirus isolate from India date: 2016-12-23 words: 3755.0 sentences: 217.0 pages: flesch: 49.0 cache: ./cache/cord-264135-s2u76pvk.txt txt: ./txt/cord-264135-s2u76pvk.txt summary: Phylogenetic analysis of the astrovirus genomes suggested formation of separate cluster of chicken astroviruses and placed CAstV/INDIA/ANAND/2016 nearest to the CAstV/4175 isolate (Fig. 2) . B-cell epitope analysis of capsid structural protein of identified chicken astrovirus isolate A total of 9-10 epitopes were predicted using SVMTriP using the capsid protein sequence of the astroviruses. Phylogenetic analysis of the genome sequences as well as the protein sequences showed clustering of the CAstV/ INDIA/ANAND/2016 nearest to that of CastV/4175 and CAstV/GA2011 and all four chicken astrovirus formed separate cluster except capsid protein of the CAstV/Poland/G059/ 2014 isolate which was clustered along with the duck astroviruses. The analysis of capsid protein sequence of reported chicken astroviruses from India revealed limited structural divergence suggesting their common ancestral origin and recent emergence. Fig. 4 Phylogenetic relatedness of chicken astrovirus isolate CAstV/India/Anand/2016 ORF2 coding sequences (a) and ORF2 encoded capsid protein (b) with reported Indian isolates based on neighbour-joining method with abstract: OBJECTIVE: Chicken astroviruses have been known to cause severe disease in chickens leading to increased mortality and “white chicks” condition. Here we aim to characterize the causative agent of visceral gout suspected for astrovirus infection in broiler breeder chickens. METHODS: Total RNA isolated from allantoic fluid of SPF embryo passaged with infected chicken sample was sequenced by whole genome shotgun sequencing using ion-torrent PGM platform. The sequence was analysed for the presence of coding and non-coding features, its similarity with reported isolates and epitope analysis of capsid structural protein. RESULTS: The consensus length of 7513 bp genome sequence of Indian isolate of chicken astrovirus was obtained after assembly of 14,121 high quality reads. The genome was comprised of 13 bp 5′-UTR, three open reading frames (ORFs) including ORF1a encoding serine protease, ORF1b encoding RNA dependent RNA polymerase (RdRp) and ORF2 encoding capsid protein, and 298 bp of 3′-UTR which harboured two corona virus stem loop II like “s2m” motifs and a poly A stretch of 19 nucleotides. The genetic analysis of CAstV/INDIA/ANAND/2016 suggested highest sequence similarity of 86.94% with the chicken astrovirus isolate CAstV/GA2011 followed by 84.76% with CAstV/4175 and 74.48%% with CAstV/Poland/G059/2014 isolates. The capsid structural protein of CAstV/INDIA/ANAND/2016 showed 84.67% similarity with chicken astrovirus isolate CAstV/GA2011, 81.06% with CAstV/4175 and 41.18% with CAstV/Poland/G059/2014 isolates. However, the capsid protein sequence showed high degree of sequence identity at nucleotide level (98.64-99.32%) and at amino acids level (97.74–98.69%) with reported sequences of Indian isolates suggesting their common origin and limited sequence divergence. The epitope analysis by SVMTriP identified two unique epitopes in our isolate, seven shared epitopes among Indian isolates and two shared epitopes among all isolates except Poland isolate which carried all distinct epitopes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11259-016-9673-6) contains supplementary material, which is available to authorized users. url: https://www.ncbi.nlm.nih.gov/pubmed/28012117/ doi: 10.1007/s11259-016-9673-6 id: cord-341564-fvuwick5 author: Qi, Zhao-Hui title: Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application date: 2018-06-12 words: 2647.0 sentences: 178.0 pages: flesch: 54.0 cache: ./cache/cord-341564-fvuwick5.txt txt: ./txt/cord-341564-fvuwick5.txt summary: From these, we can see that physicochemical properties are widely applied with graphical representation of protein sequences by these researchers and their results seem well. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. Therefore, to mine essential information from a protein sequence, we propose an effective graphical method combining physicochemical properties of amino acids and the BLOSUM62 matrix. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids abstract: In this article, we propose a 3-dimensional graphical representation of protein sequences based on 10 physicochemical properties of 20 amino acids and the BLOSUM62 matrix. It contains evolutionary information and provides intuitive visualization. To further analyze the similarity of proteins, we extract a specific vector from the graphical representation curve. The vector is used to calculate the similarity distance between 2 protein sequences. To prove the effectiveness of our approach, we apply it to 3 real data sets. The results are consistent with the known evolution fact and show that our method is effective in phylogenetic analysis. url: https://www.ncbi.nlm.nih.gov/pubmed/29977111/ doi: 10.1177/1176934318777755 id: cord-321715-bkfkmtld author: Redelings, Benjamin D title: Incorporating indel information into phylogeny estimation for rapidly emerging pathogens date: 2007-03-14 words: 9793.0 sentences: 546.0 pages: flesch: 54.0 cache: ./cache/cord-321715-bkfkmtld.txt txt: ./txt/cord-321715-bkfkmtld.txt summary: To see if indel information improves phylogenetic resolution we compare the number of bi-partitions that are supported under the joint model and the traditional sequential approach, in which topology reconstruction assumes a previously determined alignment. These parameters include a multiple alignment A that specifies the positional homology between the sequences Y, an evolutionary tree (τ, T) where τ is an unrooted bifurcating tree topology and T = (t 1 , ..., t 2N -3 ) is a vector of branch lengths along the edges in τ, and vectors Θ and Λ are parameters that characterize the letter substitution and indel processes respectively. We therefore propose a new pairwise alignment prior that maintains a fixed sequence length distribution φ even when the indel probability varies from branch to branch. Since the joint model balances substitution and indel information as well as taking alignment ambiguity into account we assume that these differences represent an improvement in the accuracy of estimation. abstract: BACKGROUND: Phylogenies of rapidly evolving pathogens can be difficult to resolve because of the small number of substitutions that accumulate in the short times since divergence. To improve resolution of such phylogenies we propose using insertion and deletion (indel) information in addition to substitution information. We accomplish this through joint estimation of alignment and phylogeny in a Bayesian framework, drawing inference using Markov chain Monte Carlo. Joint estimation of alignment and phylogeny sidesteps biases that stem from conditioning on a single alignment by taking into account the ensemble of near-optimal alignments. RESULTS: We introduce a novel Markov chain transition kernel that improves computational efficiency by proposing non-local topology rearrangements and by block sampling alignment and topology parameters. In addition, we extend our previous indel model to increase biological realism by placing indels preferentially on longer branches. We demonstrate the ability of indel information to increase phylogenetic resolution in examples drawn from within-host viral sequence samples. We also demonstrate the importance of taking alignment uncertainty into account when using such information. Finally, we show that codon-based substitution models can significantly affect alignment quality and phylogenetic inference by unrealistically forcing indels to begin and end between codons. CONCLUSION: These results indicate that indel information can improve phylogenetic resolution of recently diverged pathogens and that alignment uncertainty should be considered in such analyses. url: https://www.ncbi.nlm.nih.gov/pubmed/17359539/ doi: 10.1186/1471-2148-7-40 id: cord-267500-x3u9i1vq author: Rose, Rebecca title: Challenges in the analysis of viral metagenomes date: 2016-08-03 words: 5928.0 sentences: 308.0 pages: flesch: 40.0 cache: ./cache/cord-267500-x3u9i1vq.txt txt: ./txt/cord-267500-x3u9i1vq.txt summary: Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of lowfrequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. We recently proposed a method based on numerical sequence representations and digital signal processing data transformation (SPDT) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (Tapinos et al. abstract: Genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. Existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. There is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. In this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis. url: https://www.ncbi.nlm.nih.gov/pubmed/29492275/ doi: 10.1093/ve/vew022 id: cord-300149-djclli8n author: Ruan, Yijun title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection date: 2003-05-24 words: 4355.0 sentences: 226.0 pages: flesch: 54.0 cache: ./cache/cord-300149-djclli8n.txt txt: ./txt/cord-300149-djclli8n.txt summary: title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection METHODS: We sequenced the entire SARS viral genome of cultured isolates from the index case (SIN2500) presenting in Singapore, from three primary contacts (SIN2774, SIN2748, and SIN2677), and one secondary contact (SIN2679). In addition, a common variant associated with a non-conservative aminoacid change in the S1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the SARS virus in human populations. All genetic variations of Singapore isolates identified when compared with available SARS-CoV genome sequences were further confirmed by primer extension genotyping technology (Sequenom, San Diego, CA, USA). These sequences showed that the genomes of SARS-CoV isolated in Singapore are comprised of 29 711 bases, with the exception of a five-nucleotide deletion in strain SIN2748 and a six-nucleotide deletion in SIN2677. abstract: BACKGROUND: The cause of severe acute respiratory syndrome (SARS) has been identified as a new coronavirus. Whole genome sequence analysis of various isolates might provide an indication of potential strain differences of this new virus. Moreover, mutation analysis will help to develop effective vaccines. METHODS: We sequenced the entire SARS viral genome of cultured isolates from the index case (SIN2500) presenting in Singapore, from three primary contacts (SIN2774, SIN2748, and SIN2677), and one secondary contact (SIN2679). These sequences were compared with the isolates from Canada (TOR2), Hong Kong (CUHK-W1 and HKU39849), Hanoi (URBANI), Guangzhou (GZ01), and Beijing (BJ01, BJ02, BJ03, BJ04). FINDINGS: We identified 129 sequence variations among the 14 isolates, with 16 recurrent variant sequences. Common variant sequences at four loci define two distinct genotypes of the SARS virus. One genotype was linked with infections originating in Hotel M in Hong Kong, the second contained isolates from Hong Kong, Guangzhou, and Beijing with no association with Hotel M (p<0.0001). Moreover, other common sequence variants further distinguished the geographical origins of the isolates, especially between Singapore and Beijing. INTERPRETATION: Despite the recent onset of the SARS epidemic, genetic signatures are emerging that partition the worldwide SARS viral isolates into groups on the basis of contact source history and geography. These signatures can be used to trace sources of infection. In addition, a common variant associated with a non-conservative aminoacid change in the S1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the SARS virus in human populations. Published online May 9, 2003 http://image.thelancet.com/extras/03art4454web.pdf url: https://www.ncbi.nlm.nih.gov/pubmed/12781537/ doi: 10.1016/s0140-6736(03)13414-9 id: cord-015850-ef6svn8f author: Saitou, Naruya title: Eukaryote Genomes date: 2013-08-22 words: 7424.0 sentences: 484.0 pages: flesch: 53.0 cache: ./cache/cord-015850-ef6svn8f.txt txt: ./txt/cord-015850-ef6svn8f.txt summary: General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] . abstract: General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Genomes of multicellular organisms, plants, fungi, and animals are then briefly discussed. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7119937/ doi: 10.1007/978-1-4471-5304-7_8 id: cord-264296-0x90yubt author: Sawmya, Shashata title: Analyzing hCov genome sequences: Applying Machine Intelligence and beyond date: 2020-06-03 words: 5008.0 sentences: 312.0 pages: flesch: 60.0 cache: ./cache/cord-264296-0x90yubt.txt txt: ./txt/cord-264296-0x90yubt.txt summary: We present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. C. Several CNN-RNN based models are used to predict mutations at specific Sites of Interest (SoIs) of the sars-cov-2 genome sequence followed by further analyses of the same on several South-Asian countries. D. Overall, we present an analysis pipeline that can be further utilized as well as extended and revised (a) to study where a newly discovered genome sequence lies in relation to its predecessors in different regions of the world; (b) to analyse its virulence with respect to the number of deaths its predecessors have caused in their respective countries and (c) to analyse the mutation at specific important sites of the viral genome. abstract: Covid-19 pandemic, caused by the sars-cov-2 strain of coronavirus, has affected millions of people all over the world and taken thousands of lives. It is of utmost importance that the character of this deadly virus be studied and its nature be analysed. We present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. In a nutshell, we have prepared an analysis pipeline for hCov genome sequences leveraging the power of machine intelligence and uncovered what remained apparently shrouded by raw data. url: https://doi.org/10.1101/2020.06.03.131987 doi: 10.1101/2020.06.03.131987 id: cord-268467-btfz6ye8 author: Schreiber, Steven S. title: Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E date: 1989-03-31 words: 5035.0 sentences: 343.0 pages: flesch: 59.0 cache: ./cache/cord-268467-btfz6ye8.txt txt: ./txt/cord-268467-btfz6ye8.txt summary: The 3′-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the Coronavirus family and lends support to the theory that this region is important for the replication of negative-strand RNA. This result suggested that the HCV229E subgenomic mRNAs possess a nested-set structure similar to other coronaviruses and that A34 represented a cDNA clone of either the 3''-end of the genomic RNA or the leader sequence. The 3''-noncoding region contains the sequence TGGAAGAGCCA, 75 nucleotides from the 3''-end (Fig. 4) which is relatively conserved among coronaviruses and is found at approximately the same location in all of these viral genomes (Kapke and Brian, 1986; Skinner and Siddell, 1984; Armstrong et a/., 1983; Lapps et al., 1987; Kamahora et a/., 1988; Boursnell et al., 1985) ( Table 1) . Three intergenic regions of coronavirus mouse hepatitis virus strain A59 genome RNA contain a common nucleotide sequence that is homologous to the 3''end of the viral mRNA leader sequence abstract: Abstract Human coronaviruses are important human pathogens and have also been implicated in multiple sclerosis. To further understand the molecular biology of human coronavirus 229E (HCV-229E), molecular cloning and sequence analysis of the viral RNA have been initiated. Following established protocols, the 3′-terminal 1732 nucleotides of the genome were sequenced. A large open reading frame encodes a 389 amino acid protein of 43,366 Da, which is presumably the nucleocapsid protein. The predicted protein is similar in size, chemical properties, and amino acid sequence to the nucleocapsid proteins of other coronaviruses. This is especially evident when the sequence is compared with that of the antigenically related porcine transmissible gastroenteritis virus (TGEV), with which a region of 46% amino acid sequence homology was found. Hydropathy profiles revealed the existence of several conserved domains which could have functional significance. An intergenic consensus sequence precedes the 5′-end of the proposed nucleocapsid protein gene. The consensus sequence is present in other coronaviruses and has been proposed as the site of binding of the leader sequence for mRNA transcriptional start. This region was also examined by primer extension analysis of mRNAs, which identified a 60-nucleotide leader sequence. The 3′-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the Coronavirus family and lends support to the theory that this region is important for the replication of negative-strand RNA. url: https://api.elsevier.com/content/article/pii/0042682289900500 doi: 10.1016/0042-6822(89)90050-0 id: cord-010273-0c56x9f5 author: Simmonds, Peter title: Virology of hepatitis C virus date: 2001-10-10 words: 7897.0 sentences: 337.0 pages: flesch: 41.0 cache: ./cache/cord-010273-0c56x9f5.txt txt: ./txt/cord-010273-0c56x9f5.txt summary: 1,2 The identification of HCV led to the development of diagnostic assays for infection, based either on detection of antibody to recombinant polypeptides expressed from cloned HCV sequences or direct detection of virus ribonucleic acid (RNA) sequences by polymerase chain reaction (PCR) using primers complimentary to the HCV genome. 6 ''13 Remarkably, a series of plant viruses that are structurally distinct from each of the mammalian virus groups, and with different genome organizations, have RNA-dependent RNA polymerase amino acid sequences that are perhaps more similar to those of HCV than are the flaviviruses. In contrast to the highly restricted sequence diversity of the 5''NCR and adjacent core region, the two putative envelope genes are highly divergent between different variants of HCV (Table III) 111-114 and show a three-to-four-times higher rate of sequence change with time in persistently infected patients, ll5 Because these proteins are likely to lie on the outside of the virus, they would be the principal targets of the humoral immune response to HCV elicited on infection. abstract: Hepatitis C virus (HCV) has been identified as the main causative agent of post-transfusion non-A, non-B hepatitis. Through recently developed diagnostic assays, routine serologic screening of blood donors has prevented most cases of post-transfusion hepatitis. The purpose of this paper is to comprehensively review current information regarding the virology of HCV. Recent findings on the genome organization, its relationship to other viruses, the replication of HCV ribonucleic acid, HCV translation, and HCV polyprotein expression and processing are discussed. Also reviewed are virus assembly and release, the variability of HCV and its classification into genotypes, the geographic distribution of HCV genotypes, and the biologic differences between HCV genotypes. The assays used in HCV genotyping are discussed in terms of reliability and consistency of results, and the molecular epidemiology of HCV infection is reviewed. These approaches to HCV epidemiology will prove valuable in documenting the spread of HCV in different risk groups, evaluating alternative (nonparenteral) routes of transmission, and in understanding more about the origins and evolution of HCV. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7173289/ doi: 10.1016/s0149-2918(96)80193-7 id: cord-213136-euv6pqh5 author: Singh, Kulveer title: Sequence Effects on Internal Structure of Droplets of Associative Polymers date: 2020-05-17 words: 4329.0 sentences: 184.0 pages: flesch: 56.0 cache: ./cache/cord-213136-euv6pqh5.txt txt: ./txt/cord-213136-euv6pqh5.txt summary: We study the evolution of internal structure of large droplets (morphology of clusters of stickers) and the kinetics of interconversion between intramolecular and intermolecular associations, for different sequences of our model polymers. Since at t = 0 we begin with a dilute solution of associating polymers in poor solvent in which most of the chains contain intramolecular bonds between their stickers, the observation of a second peak that corresponds to intermolecular bridges means that major molecular rearrangement takes place inside droplets formed by polymers with s8s, 1s6s1 and 2s4s2 sequences. For three of the sequences (s8s, 1s6s1 and 2s4s2) we found that the average spatial distance R ss between the two stickers of a polymer inside the condensed droplet has a bimodal distribution, such that one of the peaks corresponds to intramolecular bonds and the other to intermolecular bridges between clusters (or between different parts of a long fiber of stickers). abstract: We used Langevin dynamics simulations of short associative polymers with two stickers placed symmetrically along their contour to study the effect of the primary sequence of these polymers on their organization inside condensed droplets. We observed that the shape, size and number of sticker clusters inside the condensed droplet change from a single cylindrical fiber to many compact clusters, as one varies the location of stickers along the chain contour. Aging due to conversion of intramoleclular to intermolecular associations was observed in droplets of telechelic polymers, but not for other sequences of associating polymers. The relevance of our results to condensates of intrinsically disordered proteins is discussed. url: https://arxiv.org/pdf/2005.08246v1.pdf doi: nan id: cord-022348-w7z97wir author: Sola, Monica title: Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing? date: 2007-09-02 words: 10892.0 sentences: 671.0 pages: flesch: 56.0 cache: ./cache/cord-022348-w7z97wir.txt txt: ./txt/cord-022348-w7z97wir.txt summary: An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships. Under the rubric replication, a virus could vary to increase its fitness, exploit different target cells or evade adaptive immune responses. For a given virus, different protein sequence sets were compared to a given reference such as RT in the case of HIV/SIV. Although these data were derived from completely sequenced primate immunodeficiency viral genomes, analyses on larger data sets, such as p17 Gag/p24 Gag or gp120/gp41, yielded relative values that differed from those given in Table 6 .1 by at most 14%. An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships (Table 6 .1). In the clear cases where genetic variation is exploited by RNA viruses, it is used to overcome barriers to transmission set up by the host population, e.g. herd immunity. abstract: This chapter argues that the vast majority of genetic changes or mutations fixed by RNA viruses are essentially neutral or nearly neutral in character. In molecular evolution one of the remarkable observations has been the uniformity of the molecular clock. An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships. These analyses indicate that viral protein diversification is essentially a smooth process, the major parameter being the nature of the protein more than the ecological niche it finds itself in. Synonymous changes are invariably more frequent than nonsynonymous changes. Positive selection exploits a small proportion of genetic variants, while functional sequence space is sufficiently dense, allowing viable solutions to be found. Although evolution has connotations of change, what has always counted is natural selection or adaptation. It is the only force for the genesis of a novel replicon. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7155598/ doi: 10.1016/b978-012220360-2/50007-6 id: cord-266960-kyx6xhvj author: Temple, Mark D. title: Real-time audio and visual display of the Coronavirus genome date: 2020-10-02 words: 6780.0 sentences: 360.0 pages: flesch: 56.0 cache: ./cache/cord-266960-kyx6xhvj.txt txt: ./txt/cord-266960-kyx6xhvj.txt summary: The sonification of codons derived from all three reading frames of the viral RNA sequence in combination with sonified metadata provide the framework for this display. CONCLUSION: The auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the RNA genome. Audio generated from each of these sequence motifs and metadata were combined to create a complex auditory display to represent either transcription or translation. High resolution analysis of gene expression in Coronavirus genomes has detected ribosome protected fragments which map to non-canonical ORF''s, these may be novel protein-coding ORFs and short regulatory uORFs. The tool highlights the occurrence of one such uORF of 30 nucleotides (including the stop codon) in the 5′ untranslated region downstream of TRS1 [35] that is not documented in the GenBank metadata. In the Additional file 4: supplementary example ''Sonification Sub-genomic RNA'' the auditory display represents the process of transcription. abstract: BACKGROUND: This paper describes a web based tool that uses a combination of sonification and an animated display to inquire into the SARS-CoV-2 genome. The audio data is generated in real time from a variety of RNA motifs that are known to be important in the functioning of RNA. Additionally, metadata relating to RNA translation and transcription has been used to shape the auditory and visual displays. Together these tools provide a unique approach to further understand the metabolism of the viral RNA genome. This audio provides a further means to represent the function of the RNA in addition to traditional written and visual approaches. RESULTS: Sonification of the SARS-CoV-2 genomic RNA sequence results in a complex auditory stream composed of up to 12 individual audio tracks. Each auditory motive is derived from the actual RNA sequence or from metadata. This approach has been used to represent transcription or translation of the viral RNA genome. The display highlights the real-time interaction of functional RNA elements. The sonification of codons derived from all three reading frames of the viral RNA sequence in combination with sonified metadata provide the framework for this display. Functional RNA motifs such as transcription regulatory sequences and stem loop regions have also been sonified. Using the tool, audio can be generated in real-time from either genomic or sub-genomic representations of the RNA. Given the large size of the viral genome, a collection of interactive buttons has been provided to navigate to regions of interest, such as cleavage regions in the polyprotein, untranslated regions or each gene. These tools are available through an internet browser and the user can interact with the data display in real time. CONCLUSION: The auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the RNA genome. Furthermore, the tool has been used as an algorithmic based audio generator. These audio tracks can be listened to by the general community without reference to the visual display to encourage further inquiry into the science. url: https://doi.org/10.1186/s12859-020-03760-7 doi: 10.1186/s12859-020-03760-7 id: cord-300807-9u8idlon author: Tong, Joo Chuan title: 7 Infectious disease informatics date: 2013-12-31 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: Abstract: Throughout history, infectious diseases have posed a serious burden to mankind. More recently, there has been an alarming increase in drug-resistant microbes. Furthermore, new pathogens are emerging due to microbial evolution and adaptation. The spread of these diseases is a result of pathogen mutations and changes in human behavior patterns. Then, there are diseases that are lurking in the background, waiting for the right conditions before they strike again. In the war against these diseases, we have come to understand the behaviors of microbes in a heterogeneous world and the mechanisms governing disease transmission. These works have profoundly shaped modern knowledge of emerging and re-emerging infections. More recently, computational techniques have led the way into this new era by allowing rapid high-throughput analysis of pathogens which was previously not possible using traditional laboratory techniques. This chapter introduces methods in mathematical modeling, computational biology, and bioinformatics that have been used to study infectious diseases. url: https://api.elsevier.com/content/article/pii/B9781907568411500076 doi: 10.1533/9781908818416.99 id: cord-254942-g51mjj2b author: Touati, Rabeb title: New methodology for repetitive sequences identification in human X and Y chromosomes date: 2020-10-19 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: Repetitive DNA sequences occupy the major proportion of DNA in the human genome and even in the other species’ genomes. The importance of each repetitive DNA type depends on many factors: structural and functional roles, positions, lengths and numbers of these repetitions are clear examples. Conserving such DNA sequences or not in different locations in the chromosome remains a challenge for researchers in biology. Detecting their location despite their great variability and finding novel repetitive sequences remains a challenging task. To side-step this problem, we developed a new method based on signal and image processing tools. In fact, using this method we could find repetitive patterns in DNA images regardless of the repetition length. This new technique seems to be more efficient in detecting new repetitive sequences than bioinformatics tools. In fact, the classical tools present limited performances especially in case of mutations (insertion or deletion). However, modifying one or a few numbers of pixels in the image doesn’t affect the global form of the repetitive pattern. As a consequence, we generated a new repetitive patterns database which contains tandem and dispersed repeated sequences. The highly repetitive sequences, we have identified in X and Y chromosomes, are shown to be located in other human chromosomes or in other genomes. The data we have generated is then taken as input to a Convolutional neural network classifier in order to classify them. The system we have constructed is efficient and gives an average of 94.4% as recognition score. url: https://www.ncbi.nlm.nih.gov/pubmed/33101452/ doi: 10.1016/j.bspc.2020.102207 id: cord-301827-a7hnuxy5 author: Uversky, Vladimir N title: A decade and a half of protein intrinsic disorder: Biology still waits for physics date: 2013-04-29 words: 20971.0 sentences: 1059.0 pages: flesch: 43.0 cache: ./cache/cord-301827-a7hnuxy5.txt txt: ./txt/cord-301827-a7hnuxy5.txt summary: 94 Therefore, the abundance and peculiarities of the charged residues distribution within the protein sequences might determine physical and biological properties of extended IDPs and IDPRs. Also, simple polymer physics-based reasoning can give reasonably well-justified explanation of the conformational behavior of extended IDPs. In general, the conformational behavior of IDPs is characterized by the low cooperativity (or the complete lack thereof) of the denaturant-induced unfolding, lack of the measurable excess heat absorption peak(s) characteristic for the melting of ordered proteins, "turned out" response to heat and changes in pH, and the ability to gain structure in the presence of various binding partners. 183 This analysis revealed that proteins involved in regulation and execution of PCD possess substantial amount of intrinsic disorder and IDPRs were implemented in a number of crucial functions, such as protein-protein interactions, interactions with other partners including nucleic acids and other ligands, were shown to be enriched in post-translational modification sites, and were characterized by specific evolutionary patterns. abstract: The abundant existence of proteins and regions that possess specific functions without being uniquely folded into unique 3D structures has become accepted by a significant number of protein scientists. Sequences of these intrinsically disordered proteins (IDPs) and IDP regions (IDPRs) are characterized by a number of specific features, such as low overall hydrophobicity and high net charge which makes these proteins predictable. IDPs/IDPRs possess large hydrodynamic volumes, low contents of ordered secondary structure, and are characterized by high structural heterogeneity. They are very flexible, but some may undergo disorder to order transitions in the presence of natural ligands. The degree of these structural rearrangements varies over a very wide range. IDPs/IDPRs are tightly controlled under the normal conditions and have numerous specific functions that complement functions of ordered proteins and domains. When lacking proper control, they have multiple roles in pathogenesis of various human diseases. Gaining structural and functional information about these proteins is a challenge, since they do not typically “freeze” while their “pictures are taken.” However, despite or perhaps because of the experimental challenges, these fuzzy objects with fuzzy structures and fuzzy functions are among the most interesting targets for modern protein research. This review briefly summarizes some of the recent advances in this exciting field and considers some of the basic lessons learned from the analysis of physics, chemistry, and biology of IDPs. url: https://doi.org/10.1002/pro.2261 doi: 10.1002/pro.2261 id: cord-339209-oe8onyr9 author: Vasilakis, Nikos title: Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range date: 2014-05-20 words: 5817.0 sentences: 272.0 pages: flesch: 46.0 cache: ./cache/cord-339209-oe8onyr9.txt txt: ./txt/cord-339209-oe8onyr9.txt summary: The organization of each genome was similar to that described previously for the mesoniviruses (NDiV, CavV, HanaV, NseV and MenoV), featuring a long 5''-untranslated region (5''-UTR) of 359 to 370 nt, six major long open reading frames (ORFs), and a long terminal region of 1780 to 1804 nt preceding the poly[A] tail ( Figure 2 ). To determine the phylogenetic relationships of the newly identified insect viruses, maximum likelihood (ML) phylogenetic trees were constructed based on the amino acid alignments of ORF2a (unprocessed S protein) and a concatenated region of the highly conserved domains within ORF1ab (3CL pro , RdRp and ZnHel1). A Clustal X alignment of the mesonivirus ORF3a proteins and individual structural analyses using SignalP and TMHMM and NetNGlyc (www.expasy.org) indicated that each is a class I transmembrane glycoprotein with a predicted N-termimal signal peptide, an ectodomain containing a conserved set of 6 cysteine residues and a single conserved N-glycosylation site, a transmembrane domain and a C-terminal cytoplasmic domain ( Figure 4A, 4D) . abstract: BACKGROUND: The family Mesoniviridae (order Nidovirales) comprises of a group of positive-sense, single-stranded RNA ([+]ssRNA) viruses isolated from mosquitoes. FINDINGS: Thirteen novel insect-specific virus isolates were obtained from mosquitoes collected in Indonesia, Thailand and the USA. By electron microscopy, the virions appeared as spherical particles with a diameter of ~50 nm. Their 20,129 nt to 20,777 nt genomes consist of positive-sense, single-stranded RNA with a poly-A tail. Four isolates from Houston, Texas, and one isolate from Java, Indonesia, were identified as variants of the species Alphamesonivirus-1 which also includes Nam Dinh virus (NDiV) from Vietnam and Cavally virus (CavV) from Côte d’Ivoire. The eight other isolates were identified as variants of three new mesoniviruses, based on genome organization and pairwise evolutionary distances: Karang Sari virus (KSaV) from Java, Bontag Baru virus (BBaV) from Java and Kalimantan, and Kamphaeng Phet virus (KPhV) from Thailand. In comparison with NDiV, the three new mesoniviruses each contained a long insertion (180 – 588 nt) of unknown function in the 5’ region of ORF1a, which accounted for much of the difference in genome size. The insertions contained various short imperfect repeats and may have arisen by recombination or sequence duplication. CONCLUSIONS: In summary, based on their genome organizations and phylogenetic relationships, thirteen new viruses were identified as members of the family Mesoniviridae, order Nidovirales. Species demarcation criteria employed previously for mesoniviruses would place five of these isolates in the same species as NDiV and CavV (Alphamesonivirus-1) and the other eight isolates would represent three new mesonivirus species (Alphamesonivirus-5, Alphamesonivirus-6 and Alphamesonivirus-7). The observed spatiotemporal distribution over widespread geographic regions and broad species host range in mosquitoes suggests that mesoniviruses may be common in mosquito populations worldwide. url: https://doi.org/10.1186/1743-422x-11-97 doi: 10.1186/1743-422x-11-97 id: cord-296691-cg463fbn author: Wang, Ren title: De novo Sequence Assembly and Characterization of Lycoris aurea Transcriptome Using GS FLX Titanium Platform of 454 Pyrosequencing date: 2013-04-09 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: BACKGROUND: Lycoris aurea, also called Golden Magic Lily, is an ornamentally and medicinally important species of the Amaryllidaceae family. To date, the sequencing of its whole genome is unavailable as a non-model organism. Transcriptomic information is also scarce for this species. In this study, we performed de novo transcriptome sequencing to produce the first comprehensive expressed sequence tag (EST) dataset for L. aurea using high-throughput sequencing technology. METHODOLOGY AND PRINCIPAL FINDINGS: Total RNA was isolated from leaves with sodium nitroprusside (SNP), salicylic acid (SA), or methyl jasmonate (MeJA) treatment, stems, and flowers at the bud, blooming, and wilting stages. Equal quantities of RNA from each tissue and stage were pooled to construct a cDNA library. Using 454 pyrosequencing technology, a total of 937,990 high quality reads (308.63 Mb) with an average read length of 329 bp were generated. Clustering and assembly of these reads produced a non-redundant set of 141,111 unique sequences, comprising 24,604 contigs and 116,507 singletons. All of the unique sequences were involved in the biological process, cellular component and molecular function categories by GO analysis. Potential genes and their functions were predicted by KEGG pathway mapping and COG analysis. Based on our sequence analysis and published literatures, many putative genes involved in Amaryllidaceae alkaloids synthesis, including PAL, TYDC OMT, NMT, P450, and other potentially important candidate genes, were identified for the first time in this Lycoris. Furthermore, 6,386 SSRs and 18,107 high-confidence SNPs were identified in this EST dataset. CONCLUSIONS: The transcriptome provides an invaluable new data for a functional genomics resource and future biological research in L. aurea. The molecular markers identified in this study will provide a material basis for future genetic linkage and quantitative trait loci analyses, and will provide useful information for functional genomic research in future. url: https://www.ncbi.nlm.nih.gov/pubmed/23593220/ doi: 10.1371/journal.pone.0060449 id: cord-324216-ce3wa889 author: Wang, Zheng title: Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses date: 2008-12-01 words: 5206.0 sentences: 240.0 pages: flesch: 49.0 cache: ./cache/cord-324216-ce3wa889.txt txt: ./txt/cord-324216-ce3wa889.txt summary: Due to the great genetic diversity of HRV and HEV, in order to ensure that designed probes (referred to as probe sequences) generated from selected database sequences (referred to as prototype regions) would detect and discriminate all serotypes of HRV and HEV, a predictive model was used to assist the microarray design [17] . This study demonstrated the use of an algorithm for the design of probe sets based on an in silico predictive model [17] , developed by our group, that minimized the probes needed for detection and identification of most serotypes of HRV and HEV. A powerful feature of the expanded RPM-Flu v.30/31 resequencing pathogen microarray is that the nucleotide sequences generated from hybridization of the sample RNA/DNA and array-bound probe sets in conjunction with previously developed sequence analysis algorithm CIBSI can be easily interpreted to make serotype or strain identifications. abstract: BACKGROUND: Febrile respiratory illness (FRI) has a high impact on public health and global economics and poses a difficult challenge for differential diagnosis. A particular issue is the detection of genetically diverse pathogens, i.e. human rhinoviruses (HRV) and enteroviruses (HEV) which are frequent causes of FRI. Resequencing Pathogen Microarray technology has demonstrated potential for differential diagnosis of several respiratory pathogens simultaneously, but a high confidence design method to select probes for genetically diverse viruses is lacking. RESULTS: Using HRV and HEV as test cases, we assess a general design strategy for detecting and serotyping genetically diverse viruses. A minimal number of probe sequences (26 for HRV and 13 for HEV), which were potentially capable of detecting all serotypes of HRV and HEV, were determined and implemented on the Resequencing Pathogen Microarray RPM-Flu v.30/31 (Tessarae RPM-Flu). The specificities of designed probes were validated using 34 HRV and 28 HEV strains. All strains were successfully detected and identified at least to species level. 33 HRV strains and 16 HEV strains could be further differentiated to serotype level. CONCLUSION: This study provides a fundamental evaluation of simultaneous detection and differential identification of genetically diverse RNA viruses with a minimal number of prototype sequences. The results demonstrated that the newly designed RPM-Flu v.30/31 can provide comprehensive and specific analysis of HRV and HEV samples which implicates that this design strategy will be applicable for other genetically diverse viruses. url: https://www.ncbi.nlm.nih.gov/pubmed/19046445/ doi: 10.1186/1471-2164-9-577 id: cord-022494-d66rz6dc author: Webb, B. title: Comparative Modeling of Drug Target Proteins date: 2014-10-01 words: 8782.0 sentences: 453.0 pages: flesch: 47.0 cache: ./cache/cord-022494-d66rz6dc.txt txt: ./txt/cord-022494-d66rz6dc.txt summary: Comparative modeling consists of four main steps 23 (Figure 2 (a)): (1) fold assignment that identifies similarity between the target sequence of interest and at least one known protein structure (the template); (2) alignment of the target sequence and the template(s); (3) building a model based on the alignment with the chosen template(s); and (4) predicting model errors. Modeller implements comparative protein structure modeling by the satisfaction of spatial restraints that include: (1) homologyderived restraints on the distances and dihedral angles in the target sequence, extracted from its alignment with the template structures; 35 (2) stereochemical restraints such as bond length and bond angle preferences, obtained from the CHARMM-22 molecular mechanics force field; 107 (3) statistical preferences for dihedral angles and nonbonded interatomic distances, obtained from a representative set of known protein structures; 108 and (4) optional manually curated restraints, such as those from NMR spectroscopy, rules of secondary structure packing, cross-linking experiments, fluorescence spectroscopy, image reconstruction from electron microscopy, site-directed mutagenesis, and intuition ( Figure 2(b) ). abstract: In this perspective, we begin by describing the comparative protein structure modeling technique and the accuracy of the corresponding models. We then discuss the significant role that comparative prediction plays in drug discovery. We focus on virtual ligand screening against comparative models and illustrate the state-of-the-art by a number of specific examples. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7157477/ doi: 10.1016/b978-0-12-409547-2.11133-3 id: cord-311839-61djk4bs author: Wei, Dan title: A novel hierarchical clustering algorithm for gene sequences date: 2012-07-23 words: 8033.0 sentences: 496.0 pages: flesch: 61.0 cache: ./cache/cord-311839-61djk4bs.txt txt: ./txt/cord-311839-61djk4bs.txt summary: We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. DMk shows better performance than the k-tuple distance in our experiments, and mBKM outperforms SL, CL, AL, BKM and KM when tested on public gene sequence datasets. In this paper we propose a new alignment-free similarity measure, DMk, based on which we developed mBKM to cluster gene sequences. To evaluate the proposed similarity measure, we test DMk on gene sequence data sets and compare it with the k-tuple distance. Moreover, we use our method, mBKM with similarity measure DMk, in phylogenetic analysis to show how well the genes are grouped together and how well the resulting trees agree with existing phylogenies. In order to illustrate the efficiency of mBKM in gene sequence clustering, we ran mBKM with the k-tuple distance and DMk on real data sets listed in Table 1 . abstract: BACKGROUND: Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors. RESULTS: The proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust, CD-HIT-EST and some others. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences. CONCLUSIONS: We introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences. url: https://doi.org/10.1186/1471-2105-13-174 doi: 10.1186/1471-2105-13-174 id: cord-343863-q1y8uscj author: Whitney, Joe title: Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches date: 2005-02-08 words: 3463.0 sentences: 179.0 pages: flesch: 61.0 cache: ./cache/cord-343863-q1y8uscj.txt txt: ./txt/cord-343863-q1y8uscj.txt summary: ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. The complete ReHAB hits database can then be queried by date using a simple GUI to allow the researcher to easily identify new hits; these are highlighted, and pairwise or multiple alignments can be performed to assess the quality of the match. ReHAB consists of four main components ( Figure 1 ): (1) a MySQL relational database that stores information about hits, including biological sequences, alignments between them, and other categorization and annotation data; (2) a Java server that provides access to programs which cannot be run locally by the client on arbitrary user workstations, such as NCBI BLAST and EMBOSS [12] utilities; (3) a Java Swing graphical client, downloaded and launched on client machines using Java Web Start; (4) and a back-end Java program which runs alignment programs and compiles results in the database. abstract: BACKGROUND: Sequence similarity searching is a powerful tool to help develop hypotheses in the quest to assign functional, structural and evolutionary information to DNA and protein sequences. As sequence databases continue to grow exponentially, it becomes increasingly important to repeat searches at frequent intervals, and similarity searches retrieve larger and larger sets of results. New and potentially significant results may be buried in a long list of previously obtained sequence hits from past searches. RESULTS: ReHAB (Recent Hits Acquired from BLAST) is a tool for finding new protein hits in repeated PSI-BLAST searches. ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. Results are presented in an easily comprehended table, or in a BLAST-like report, using colors to highlight the new hits. ReHAB is designed to handle large numbers of query sequences, such as whole genomes or sets of genomes. Advanced computer skills are not needed to use ReHAB; the graphics interface is simple to use and was designed with the bench biologist in mind. CONCLUSIONS: This software greatly simplifies the problem of evaluating the output of large numbers of protein database searches. url: https://www.ncbi.nlm.nih.gov/pubmed/15701178/ doi: 10.1186/1471-2105-6-23 id: cord-103029-nc5yf6x4 author: Wichmann, Stefan title: Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank date: 2020-09-25 words: 8665.0 sentences: 387.0 pages: flesch: 52.0 cache: ./cache/cord-103029-nc5yf6x4.txt txt: ./txt/cord-103029-nc5yf6x4.txt summary: In this study the artificially designed sequences are compared to their original sequences in terms of amino acid identity, amino acid similarity, Hidden Markov Model profile and secondary structure in order to determine the impact of OLG construction and which sequences are potentially functional. While the previous study [30] tried to estimate an upper limit of how many domains can be successfully overlapped in at least one reading frame and position, here the average success rate for OLG construction is determined instead, which is more relevant in relation to both understanding constraints on the formation rate of naturally occuring OLGs and in assessing the likelihood of successful synthetic creation of OLGs. These results in one sense give an upper estimate of the ease of creating overlaps as the difficulty of obtaining an overlapping gene pair naturally is not directly addressed here. abstract: Overlapping genes (OLGs) with long protein-coding overlapping sequences are often excluded by genome annotation programs, with the exception of virus genomes. A recent study used a novel algorithm to construct OLGs from arbitrary protein domain pairs and concluded that virus genes are best suited for creating OLGs, a result which fitted with common assumptions. However, improving sequence evaluation using Hidden Markov Models shows that the previous result is an artifact originating from dataset-database biases. When parameters for OLG design and evaluation are optimized we find that 94.5% of the constructed OLG pairs score at least as highly as naturally occurring sequences, while 9.6% of the artificial OLGs cannot be distinguished from typical sequences in their protein family. Constructed OLG sequences are also indistinguishable from natural sequences in terms of amino acid identity and secondary structure, while the minimum nucleotide change required for overprinting an overlapping sequence can be as low as 1.8% of the sequence. Separate analysis of datasets containing only sequences from either archaea, bacteria, eukaryotes or viruses showed that, surprisingly, virus genes are much less suitable for designing OLGs than bacterial or eukaryotic genes. An important factor influencing OLG design is the structure of the standard genetic code. Success rates in different reading frames strongly correlate with their code-determined respective amino acid constraints. There is a tendency indicating that the structure of the standard genetic code could be optimized in its ability to create OLGs while conserving mutational robustness. The findings reported here add to the growing evidence that OLGs should no longer be excluded in prokaryotic genome annotations. Determining the factors facilitating the computational design of artificial overlapping genes may improve our understanding of the origin of these remarkable genetic constructs and may also open up exciting possibilities for synthetic biology. url: https://doi.org/10.1101/2020.09.25.312959 doi: 10.1101/2020.09.25.312959 id: cord-103297-4stnx8dw author: Widrich, Michael title: Modern Hopfield Networks and Attention for Immune Repertoire Classification date: 2020-08-17 words: 14093.0 sentences: 926.0 pages: flesch: 57.0 cache: ./cache/cord-103297-4stnx8dw.txt txt: ./txt/cord-103297-4stnx8dw.txt summary: In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. DeepRC sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1D convolutions or LSTMs. In this work, we contribute the following: We demonstrate that continuous generalizations of binary modern Hopfield-networks (Krotov & Hopfield, 2016 Demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. We evaluate the predictive performance of DeepRC and other machine learning approaches for the classification of immune repertoires in a large comparative study (Section "Experimental Results") Exponential storage capacity of continuous state modern Hopfield networks with transformer attention as update rule abstract: A central mechanism in machine learning is to identify, store, and recognize patterns. How to learn, access, and retrieve such patterns is crucial in Hopfield networks and the more recent transformer architectures. We show that the attention mechanism of transformer architectures is actually the update rule of modern Hop-field networks that can store exponentially many patterns. We exploit this high storage capacity of modern Hopfield networks to solve a challenging multiple instance learning (MIL) problem in computational biology: immune repertoire classification. Accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the COVID-19 crisis. Immune repertoire classification based on the vast number of immunosequences of an individual is a MIL problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. We demonstrate that DeepRC outperforms all other methods with respect to predictive performance on large-scale experiments, including simulated and real-world virus infection data, and enables the extraction of sequence motifs that are connected to a given disease class. Source code and datasets: https://github.com/ml-jku/DeepRC url: https://doi.org/10.1101/2020.04.12.038158 doi: 10.1101/2020.04.12.038158 id: cord-253436-dz84icdc author: Wille, Michelle title: High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl date: 2016-03-03 words: 2019.0 sentences: 103.0 pages: flesch: 54.0 cache: ./cache/cord-253436-dz84icdc.txt txt: ./txt/cord-253436-dz84icdc.txt summary: In this study we screened 764 samples from 22 avian species of the orders Anseriformes and Charadriiformes in Sweden collected in 2006/2007 for CoV, with an overall CoV prevalence of 18.7%, which is higher than many other wild bird surveys. Coronavirus sequences from Mallards in this study were highly similar to CoV sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. Despite few studies, small samples sizes and differences in prevalence, what is clear, is that in the Northern Hemisphere waterfowl species, especially dabbling and diving ducks are important in the epidemiology of avian CoVs. It is interesting to note that these patterns are very similar to those found in low pathogenic influenza A viruses: high prevalence in waterfowl and gulls in the Northern Hemisphere [30] , and little host species and temporal structuring within waterfowl derived viruses in the conserved polymerase genes (such as PB2, PB1) [31] . abstract: Coronaviruses (CoVs) are found in a wide variety of wild and domestic animals, and constitute a risk for zoonotic and emerging infectious disease. In poultry, the genetic diversity, evolution, distribution and taxonomy of some coronaviruses have been well described, but little is known about the features of CoVs in wild birds. In this study we screened 764 samples from 22 avian species of the orders Anseriformes and Charadriiformes in Sweden collected in 2006/2007 for CoV, with an overall CoV prevalence of 18.7%, which is higher than many other wild bird surveys. The highest prevalence was found in the diving ducks—mainly Greater Scaup (Aythya marila; 51.5%)—and the dabbling duck Mallard (Anas platyrhynchos; 19.2%). Sequences from two of the Greater Scaup CoV fell into an infrequently detected lineage, shared only with a Tufted Duck (Aythya fuligula) CoV. Coronavirus sequences from Mallards in this study were highly similar to CoV sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. A single Black-headed Gull represented the only positive sample from the order Charadriiformes. Globally, Anas species represent the largest fraction of avian CoV sequences, and there seems to be no host species, geographical or temporal structure. To better understand the eitiology, epidemiology and ecology of these viruses more systematic surveillance of wild birds and subsequent sequencing of detected CoV is imperative. url: https://doi.org/10.1371/journal.pone.0150198 doi: 10.1371/journal.pone.0150198 id: cord-280881-5o38ihe0 author: Wlodawer, Alexander title: A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases date: 2003-11-11 words: 4862.0 sentences: 220.0 pages: flesch: 51.0 cache: ./cache/cord-280881-5o38ihe0.txt txt: ./txt/cord-280881-5o38ihe0.txt summary: These structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (Ser, Glu, Asp) and by the presence of an Asp in the oxyanion hole [8] . We have now applied the tools of molecular homology modeling to predicting a structure of CLN2 that could be used as a basis for a search for the biological substrates of this family of enzymes and for the design of specific inhibitors. Mammalian enzymes homologous to human CLN2 [2, 4] form a subfamily of sedolisins with highly conserved sequences ( Figure 1 ). Exploiting the sequence similarity between CLN2, sedolisin, and kumamolisin ( Figure 4 ), we have now used the experimentally obtained structures of the latter two enzymes to form a new, homology-derived model of human CLN2. abstract: BACKGROUND: Tripeptidyl-peptidase I, also known as CLN2, is a member of the family of sedolisins (serine-carboxyl peptidases). In humans, defects in expression of this enzyme lead to a fatal neurodegenerative disease, classical late-infantile neuronal ceroid lipofuscinosis. Similar enzymes have been found in the genomic sequences of several species, but neither systematic analyses of their distribution nor modeling of their structures have been previously attempted. RESULTS: We have analyzed the presence of orthologs of human CLN2 in the genomic sequences of a number of eukaryotic species. Enzymes with sequences sharing over 80% identity have been found in the genomes of macaque, mouse, rat, dog, and cow. Closely related, although clearly distinct, enzymes are present in fish (fugu and zebra), as well as in frogs (Xenopus tropicalis). A three-dimensional model of human CLN2 was built based mainly on the homology with Pseudomonas sp. 101 sedolisin. CONCLUSION: CLN2 is very highly conserved and widely distributed among higher organisms and may play an important role in their life cycles. The model presented here indicates a very open and accessible active site that is almost completely conserved among all known CLN2 enzymes. This result is somehow surprising for a tripeptidase where the presence of a more constrained binding pocket was anticipated. This structural model should be useful in the search for the physiological substrates of these enzymes and in the design of more specific inhibitors of CLN2. url: https://www.ncbi.nlm.nih.gov/pubmed/14609438/ doi: 10.1186/1472-6807-3-8 id: cord-018963-2lia97db author: Xu, Ying title: Protein Structure Prediction by Protein Threading date: 2010-04-29 words: 15309.0 sentences: 716.0 pages: flesch: 48.0 cache: ./cache/cord-018963-2lia97db.txt txt: ./txt/cord-018963-2lia97db.txt summary: Their follow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now. abstract: The seminal work of Bowie, Lüthy, and Eisenberg (Bowie et al., 1991) on “the inverse protein folding problem” laid the foundation of protein structure prediction by protein threading. By using simple measures for fitness of different amino acid types to local structural environments defined in terms of solvent accessibility and protein secondary structure, the authors derived a simple and yet profoundly novel approach to assessing if a protein sequence fits well with a given protein structural fold. Their follow-up work (Elofsson et al., 1996; Fischer and Eisenberg, 1996; Fischer et al., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et al., 1992) on protein fold recognition led to the development of a new brand of powerful tools for protein structure prediction, which we now term “protein threading.” These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many of the proteins encoded in the hundreds of genomes that have been sequenced up to now. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7123984/ doi: 10.1007/978-0-387-68825-1_1 id: cord-010499-yefxrj30 author: Yelverton, Elizabeth title: The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli date: 2006-10-27 words: 5883.0 sentences: 330.0 pages: flesch: 60.0 cache: ./cache/cord-010499-yefxrj30.txt txt: ./txt/cord-010499-yefxrj30.txt summary: Ribosomal frameshifting in both rightward and leftward directions has also been shown to occur at certain ''hungry'' codons whose cognate aminoacyi-tRNAs are in short supply (Gallant and Foley, 1980; Weiss and Gailant, 1983; 1986; Gallant et ai, 1985; Kurland and Gallant, 1986) . Not all hungry codons are equally prone to shift: in a survey of 21 frameshift mutations of the rllB gene of phage T4, Weiss and Gallant (1986) found that oniy a minority were phenotypicaily suppressible when challenged by limitation for any of several aminoacyl-tRNAs. The context njies governing ribosome frameshifting at hungry sites are under investigation, and have been defined in a few cases (Weiss et al., 1988; Gallant and Lindsiey, 1992; Peter et ai. coli the rate of ribosomal frameshifting on that sequence can be increased by limitation for leucine, the amino acid encoded at the frameshift site. abstract: A 15‐17 nucleotide sequence from the gag‐pol ribosome frameshift site of HIV‐1 directs analogous ribosomal frameshifting in Escherichia coli. Limitation for leucine, which is encoded precisely at the frameshift site, dramatically increased the frequency of leftward frameshifting. Limitation for phenylaianine or arginine, which are encoded just before and just after the frameshift, did not significantly affect frameshifting. Protein sequence analysis demonstrated the occurrence of two closeiy related frameshift mechanisms. In the first, ribosomes appear to bind leucyl‐tRNA at the frameshift site and then slip leftward. This is the 'simultaneous slippage’mechanism. In the second, ribosomes appear to slip before binding amlnoacyl‐tRNA, and then bind phenylaianyl‐tRNA, which is encoded in the left‐shifted reading frame. This mechanism is identicai to the‘overlapping reading’we have demonstrated at other bacterial frameshift sites. The HIV‐1 sequence is prone to frame‐shifting by both mechanisms in E. coli. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7192232/ doi: 10.1111/j.1365-2958.1994.tb00310.x id: cord-005060-n901y2d4 author: ZHANG, Feiyun title: Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus date: 2001 words: 2602.0 sentences: 173.0 pages: flesch: 62.0 cache: ./cache/cord-005060-n901y2d4.txt txt: ./txt/cord-005060-n901y2d4.txt summary: The largest ORF 2 encodes a polyprotein of 947 amino acids (103.6 kDa), which codes for a serine protease and an RNA-dependent RNA polymerase. The genome sequence of sobernoviruses has been determined in Southern bean mosaic virus (SBMV)''2,24), CfMV8315), Rice yellow mottle virus (RYMV)") and Lucerne transient streak virus (LTSV, accession number U31286). However, the con-served sequence, WAG + E/D rich sequence is detected in the region, and putative E/S cleavage sites are present on both sides of the region : proteolytic cleavage would result in a protein of 9 kDa. Possibly, the VPg of RGMoV is located between the protease and the RNA-dependent RNA polymerase domains in the same order as in the SBMV ORF 222) (Fig. 3) . In the RGMoV RNA sequence, no ORF corresponds to the second largest product of 68 kDa. The putative replicase of CfMV is translated as part of a single polyprotein by -1 ribosomal frameshifting between two overlapping ORFs having a coding capacity for 60.9 kDa and 56.3 kDa proteins7J8). abstract: The genome of Ryegrass mottle virus (RGMoV) comprises 4210 nucleotides. The genomic RNA contains four open reading frames (ORFs). The largest ORF 2 encodes a polyprotein of 947 amino acids (103.6 kDa), which codes for a serine protease and an RNA-dependent RNA polymerase. The viral coat protein is encoded on ORF 4 present at the 3′-proximal region. Other ORFs 1 and 3 encode the predicted 14.6 kDa and 19.8 kDa proteins of unknown function. The consensus signal for frameshifting, heptanucleotide UUUAAAC and a stem-loop structure just downstream is in front of the AUG codon of ORF 3. Analysis of the in vitro translation products of RGMoV RNA suggests that the 68 kDa protein may represent a fusion protein of ORF 2-ORF 3 produced by frameshifting. The protease region of the polyprotein and coat protein have a low similarity with that of the sobemoviruses (approximately 25% amino acid identity), while the RNA-dependent RNA polymerase region has particularly strong similarity (54 to 60% of more than 350 amino acid residues). The sequence similarities of RGMoV to the sobemoviruses, together with the characteristic genome organization indicate that RGMoV is a new species of the genus Sobemovirus. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7088213/ doi: 10.1007/pl00012989 id: cord-340907-j9i1wlak author: Zarai, Yoram title: Evolutionary selection against short nucleotide sequences in viruses and their related hosts date: 2020-04-27 words: 8162.0 sentences: 415.0 pages: flesch: 45.0 cache: ./cache/cord-340907-j9i1wlak.txt txt: ./txt/cord-340907-j9i1wlak.txt summary: Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. Figure 3A and B depicts the average number of under-represented sequences of size m ¼ 3, 4, and 5 nucleotides, identified in few subsets of viruses in both the original and random variants of the virus. A sampling analysis that we performed (see Supplementary document, Section 2.8) suggests that the number of under-represented sequences identified in dsDNA viruses matches their genomic size, when compared with RNA viruses. To show that the correspondence between selection against short palindromic sequences in viruses and restriction sites cannot be explained by basic coding region features such as amino-acid content and order, codon usage bias and dinucleotide distribution, we also evaluated the overlap between restriction sites and common under-represented sequences of random variants of viruses. abstract: Viruses are under constant evolutionary pressure to effectively interact with the host intracellular factors, while evading its immune system. Understanding how viruses co-evolve with their hosts is a fundamental topic in molecular evolution and may also aid in developing novel viral based applications such as vaccines, oncologic therapies, and anti-bacterial treatments. Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. These sequences cannot be explained by the coding regions’ amino acid content, codon, and dinucleotide frequencies. We specifically show that short homooligonucleotide and palindromic sequences tend to be under-represented in many viruses probably due to their effect on gene expression regulation and the interaction with the host immune system. In addition, we show that more sequences tend to be under-represented in dsDNA viruses than in other viral groups. Finally, we demonstrate, based on in vitro and in vivo experiments, how under-represented sequences can be used to attenuated Zika virus strains. url: https://www.ncbi.nlm.nih.gov/pubmed/32339222/ doi: 10.1093/dnares/dsaa008 id: cord-266794-oyppubq5 author: Zhang, Dachuan title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model date: 2020-09-01 words: 1003.0 sentences: 75.0 pages: flesch: 48.0 cache: ./cache/cord-266794-oyppubq5.txt txt: ./txt/cord-266794-oyppubq5.txt summary: title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. To circumvent this limitation, we built an integrated 2019-nCoV scientific resource platform and a consensus sequence-catalytic function model with which we developed novel methodology to analyze pathogen sequences for catalytic functions. In addition, we integrated a consensus sequence-function model (Zhang, et al., 2020) , a genome browser (Ham, et al., 2012) , and a catalytic function annotation tool (Dawson, et al., 2017) into the platform to assist in the research of novel viruses. We built an integrated platform to assist 2019-nCoV research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species. abstract: MOTIVATION: The 2019 novel coronavirus outbreak has significantly affected global health and society. Thus, predicting biological function from pathogen sequence is crucial and urgently needed. However, little work has been performed to identify viruses by the enzymes that they encode, and which are key to pathogen propagation. RESULTS: We built a comprehensive scientific resource, SARS2020, that integrates coronavirus-related research, genomic sequences, and results of anti-viral drug trials. In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. This data-driven sequence-based strategy will enable rapid identification of agents responsible for future epidemics. AVAILABILITY: SARS2020 is available at http://design.rxnfinder.org/sars2020/. SUPPLEMENTARY INFORMATION: url: https://www.ncbi.nlm.nih.gov/pubmed/32871007/ doi: 10.1093/bioinformatics/btaa767 id: cord-344782-ond1ziu5 author: Zhang, Jing title: Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi) date: 2018-10-24 words: 6003.0 sentences: 280.0 pages: flesch: 49.0 cache: ./cache/cord-344782-ond1ziu5.txt txt: ./txt/cord-344782-ond1ziu5.txt summary: Nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. Following the detection of the novel virus, in November 2015 (about 6 months after the cessation of the outbreak) an intensive survey of the parts of the river where affected turtles had been detected [2] was undertaken by groups of biologists and ecologists and samples collected from a wide range of aquatic species and some terrestrial animals (n = 360) to establish the size of the remaining population and whether any other animals were carrying this virus. BRV, as a novel nidovirus, was isolated from tissues of diseased animals, very high levels of viral RNA were detected in tissues with marked pathological changes and in situ hybridisation assays demonstrated the presence of specific viral RNA in lesions in kidneys and eye tissue-two of the main affected organs. abstract: In mid-February 2015, a large number of deaths were observed in the sole extant population of an endangered species of freshwater snapping turtle, Myuchelys georgesi, in a coastal river in New South Wales, Australia. Mortalities continued for approximately 7 weeks and affected mostly adult animals. More than 400 dead or dying animals were observed and population surveys conducted after the outbreak had ceased indicated that only a very small proportion of the population had survived, severely threatening the viability of the wild population. At necropsy, animals were in poor body condition, had bilateral swollen eyelids and some animals had tan foci on the skin of the ventral thighs. Histological examination revealed peri-orbital, splenic and nephric inflammation and necrosis. A virus was isolated in cell culture from a range of tissues. Nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. Its closest relatives are nidoviruses that have recently been described in pythons and lizards, usually in association with respiratory disease. In contrast, in the affected turtles, the most significant pathological changes were in the kidneys. Real time PCR assays developed to detect this virus demonstrated very high virus loads in affected tissues. In situ hybridisation studies confirmed the presence of viral nucleic acid in tissues in association with pathological changes. Collectively these data suggest that this virus is the likely cause of the mortalities that now threaten the survival of this species. Bellinger River Virus is the name proposed for this new virus. url: https://doi.org/10.1371/journal.pone.0205209 doi: 10.1371/journal.pone.0205209 id: cord-193910-7p3f3znj author: Zhang, Xiangxie title: Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification date: 2020-11-01 words: 7724.0 sentences: 436.0 pages: flesch: 59.0 cache: ./cache/cord-193910-7p3f3znj.txt txt: ./txt/cord-193910-7p3f3znj.txt summary: In the experiments, the performances of feature extraction using primers and random DNA sequences will be compared to several other machine learning approaches. Finally, three state-of-the-art methods, namely a con-volutional neural network (CNN), a deep neural network (DNN), and an N-gram probabilistic model, which were fed the unprocessed DNA sequences without prior feature extraction, were tested. Different machine learning algorithms will be trained and tested using each set of feature vectors in the experiments. For each data set, the results of all six machine learning algorithms using the random DNA sequence feature extraction method are presented in Table ( 8) containing mean accuracy and standard deviation over the ten folds of the cross-validation. It can be concluded that the Levenshtein distance feature extraction yields the best and most consistent results across the six different machine learning algorithms when the distance between a primer and a DNA sequence is taken. abstract: The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art algorithms, namely Convolutional Neural Networks, Deep Neural Networks, and N-gram Probabilistic Models, are used for the task of DNA classification. Furthermore, we introduce a novel feature extraction method based on the Levenshtein distance and randomly generated DNA sub-sequences to compute information-rich features from the DNA sequences. We also use an existing feature extraction method based on 3-grams to represent amino acids and combine both feature extraction methods with a multitude of machine learning algorithms. Four different data sets, each concerning viral diseases such as Covid-19, AIDS, Influenza, and Hepatitis C, are used for evaluating the different approaches. The results of the experiments show that all methods obtain high accuracies on the different DNA datasets. Furthermore, the domain-specific 3-gram feature extraction method leads in general to the best results in the experiments, while the newly proposed technique outperforms all other methods on the smallest Covid-19 dataset url: https://arxiv.org/pdf/2011.00485v1.pdf doi: nan id: cord-031957-df4luh5v author: dos Santos-Silva, Carlos André title: Plant Antimicrobial Peptides: State of the Art, In Silico Prediction and Perspectives in the Omics Era date: 2020-09-02 words: nan sentences: nan pages: flesch: nan cache: txt: summary: abstract: Even before the perception or interaction with pathogens, plants rely on constitutively guardian molecules, often specific to tissue or stage, with further expression after contact with the pathogen. These guardians include small molecules as antimicrobial peptides (AMPs), generally cysteine-rich, functioning to prevent pathogen establishment. Some of these AMPs are shared among eukaryotes (eg, defensins and cyclotides), others are plant specific (eg, snakins), while some are specific to certain plant families (such as heveins). When compared with other organisms, plants tend to present a higher amount of AMP isoforms due to gene duplications or polyploidy, an occurrence possibly also associated with the sessile habit of plants, which prevents them from evading biotic and environmental stresses. Therefore, plants arise as a rich resource for new AMPs. As these molecules are difficult to retrieve from databases using simple sequence alignments, a description of their characteristics and in silico (bioinformatics) approaches used to retrieve them is provided, considering resources and databases available. The possibilities and applications based on tools versus database approaches are considerable and have been so far underestimated. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7476358/ doi: 10.1177/1177932220952739 id: cord-001835-0s7ok4uw author: nan title: Abstracts of the 29th Annual Symposium of The Protein Society date: 2015-10-01 words: 138514.0 sentences: 6150.0 pages: flesch: 40.0 cache: ./cache/cord-001835-0s7ok4uw.txt txt: ./txt/cord-001835-0s7ok4uw.txt summary: Altogether, these results indicate that, although PHDs might be more selective for HIF as a substrate as it was initially thought, the enzymatic activity of the prolyl hydroxylases is possibly influenced by a number of other proteins that can directly bind to PHDs. Non-natural aminoacids via the MIO-enzyme toolkit Alina Filip 1 , Judith H Bartha-V ari 1 , Gergely B an oczy 2 , L aszl o Poppe 2 , Csaba Paizs 1 , Florin-Dan Irimie 1 1 Biocatalysis and Biotransformation Research Group, Department of Chemistry, UBB, 2 Department of Organic Chemistry and Technology An attractive enzymatic route to enantiomerically pure to the highly valuable a-or b-aromatic amino acids involves the use of aromatic ammonia lyases (ALs) and aminomutases (AMs). Continuing our studies of the effect of like-charged residues on protein-folding mechanisms, in this work, we investigated, by means of NMR spectroscopy and molecular-dynamics simulations, two short fragments of the human Pin1 WW domain [hPin1(14-24); hPin1(15-23)] and one single point mutation system derived from hPin1(14-24) in which the original charged residues were replaced with non-polar alanine residues. abstract: nan url: https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/pro.2823 doi: 10.1002/pro.2823 id: cord-004879-pgyzluwp author: nan title: Programmed cell death date: 1994 words: 81677.0 sentences: 4465.0 pages: flesch: 51.0 cache: ./cache/cord-004879-pgyzluwp.txt txt: ./txt/cord-004879-pgyzluwp.txt summary: Furthermore kinetic experiments after complementation of HIV=RT p66 with KIV-RT pSl indicated that HIV-RT pSl can restore rate and extent of strand displacement activity by HIV-RT p66 compared to the HIV-RT heterodimer D66/D51, suggesting a function of the 51 kDa polypeptide, The mouse mammary tumor virus proviral DNA contains an open reading frame in the 3'' long terminal repeat which can code for a 36 kDa polypeptide with a putative transmembrane sequence and five N-linked glycosylation sites. To this end we used constructs encoding the c-fos (and c-jun) genes fused to the hormone-binding domain of the human estrogen receptor, designated c-FosER (and c-JunER), We could show that short-term activation (30 mins.) of c-FosER by estradiole (E2) led to the disruption of epithelial cell polarity within 24 hours, as characterized by the expression of apical and basolateral marker proteins. abstract: nan url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7087532/ doi: 10.1007/bf02033112 id: cord-014462-11ggaqf1 author: nan title: Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh date: 2011-04-21 words: 35453.0 sentences: 1711.0 pages: flesch: 49.0 cache: ./cache/cord-014462-11ggaqf1.txt txt: ./txt/cord-014462-11ggaqf1.txt summary: Molecular diagnosis based on reverse transcription (RT)-PCR s.a. one step or nested PCR, nucleic acid sequence based amplification (NASBA), or real time RT-PCR, has gradually replaced the virus isolation method as the new standard for the detection of dengue virus in acute phase serum samples. Non-genetic methods of management of these diseases include quarantine measures, eradication of infected plants and weed hosts, crop rotation, use of certified virus-free seed or planting stock and use of pesticides to control insect vector populations implicated in transmission of viruses. The results of this study indicate that NS1 antigen based ELISA test can be an useful tool to detect the dengue virus infection in patients during the early acute phase of disease since appearance of IgM antibodies usually occur after fifth day of the infection. The studies showed high level of expression in case of constructed vector as compared to infected virus for the specific protein. abstract: nan url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3639731/ doi: 10.1007/s13337-011-0027-2 id: cord-014674-ey29970v author: nan title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 date: 2003 words: 2522.0 sentences: 181.0 pages: flesch: 62.0 cache: ./cache/cord-014674-ey29970v.txt txt: ./txt/cord-014674-ey29970v.txt summary: title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 We have closely examined the experimental data and the analyses of the nucleotide sequences presented in the report.We find that aside from problematic details of the experimental design and some erratic presentations of the data the results of the study do not provide evidence for the introgression of recombinant DNA from transgenic crop plants into the genomes of ''criollo'' maize. 3. We characterized with the help of BLAST searches those parts of the sequences of the iPCR amplification products that were denoted by Quist and Chapela in their Fig.2 as regions flanking the CMV p-35S sequence.We find that the sequence of AF434754 denoted adh1 in the K1 source of Fig. 2 does not match with the maize adh1 gene. We examined whether the identified regions in the maize genomic DNA from which PCR amplification products were obtained by the authors would perhaps be flanked by primer binding sites. abstract: nan url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7079883/ doi: 10.1007/s00103-003-0614-5 id: cord-023208-w99gc5nx author: nan title: Poster Presentation Abstracts date: 2006-09-01 words: 70854.0 sentences: 3492.0 pages: flesch: 43.0 cache: ./cache/cord-023208-w99gc5nx.txt txt: ./txt/cord-023208-w99gc5nx.txt summary: In order to develop a synthetic protocol by an automated instrumentation, increasing yield, purity of the crude, and reaction time, a microwave-assisted solid phase peptide synthesis was validated comparing the use of the new generation of Triazine-Based Coupling Reagents (TBCRs) with a series of commonly used ones. Ubiquitinium is a well known mechanism in protein degredation of Eukaryotic cells ,in which many obsolte and corrupted three dimentional structure protein ,become marked by covalent attachment of ubuquitin through a multi-step enzymatic pathway.Ubiquitin is a small ,8.5 kDa peptide of 76 amino acid residues that targets such substrtes for proteolysis in proteasome .Recnt studies showed that an extra cellular ubiquitination process also taking place in the epididymes of humans and other animals marks protein on the surface of the defective sperm .it appears that structurally and functionally defective sperm become surface ubiquitinated by epididymal epithelial cells. This head-to-tailcyclized 14-amino-acid peptide contains one disulfide bridge and a lysine residue (Lys5) present in the P1 position, which is responsible for inhibitor specificity.As was reported by us and other groups, SFTI-1 analogues with one cycle only retain trypsin inhibitory activity. abstract: nan url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7167816/ doi: 10.1002/psc.797 id: cord-023209-un2ysc2v author: nan title: Poster Presentations date: 2008-10-07 words: 111878.0 sentences: 5398.0 pages: flesch: 45.0 cache: ./cache/cord-023209-un2ysc2v.txt txt: ./txt/cord-023209-un2ysc2v.txt summary: Site-specifi c PEGylation of human IgG1-Fab using a rationally designed trypsin variant In the present contribution we report on a novel, highly selective biocatalytic method enabling C-terminal modifi cations of proteins with artifi cial functionalities under native state conditions. Recently, our group report a novel approach to a totally synthetic vaccine which consists of FMDV (Foot and Mouth Disease Virus) VP1 peptides, prepared by covalent conjugation of peptide biomolecules with membrane active carbochain polyelectrolytes In the present study, peptide epitops of VP1 protein both 135-161(P1) amino acid residues (Ser-Lys-Tyr-Ser-Thr-Thr-Gly-Glu-Arg-Thr-Arg-Thr-Arg-Gly-Asp-Leu-Gly-Ala-Leu-Ala-Ala-Arg-Val-Ala-Thr-Gln-Leu-Pro-Ala) and triptophan (Trp) containing on the N terminus 135-161 amino acid residues (Trp-135-161) (P2) were synthesized by using the microwave assisted solid-phase methods. Using as a template a peptide, already identifi ed, with agonist activity against PTPRJ(H-[Cys-His-His-Asn-Leu-Thr-His-Ala-Cys]-OH), here we report a structure-activity study carried out through endocyclic modifi cations (Ala-scan, D-substitutions, single residue deletions, substitutions of the disulfi de bridge) and the preliminary biological results of this set of compounds. abstract: nan url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7167823/ doi: 10.1002/psc.1090 id: cord-023647-dlqs8ay9 author: nan title: Sequences and topology date: 2003-03-21 words: 4505.0 sentences: 747.0 pages: flesch: 69.0 cache: ./cache/cord-023647-dlqs8ay9.txt txt: ./txt/cord-023647-dlqs8ay9.txt summary: Nucleotide Sequence Analysis of the L G~ne of Vesicular Stomafltia Virus (New Jersey Serotype) --Identification of Conserved Domai~L~ in L Proteins of Nonsegmented Negative-Strand RNA Viruses DERSE I~ Equine Infectious Anemia Virus tat--Insights into the Structure, Function, and Evolution of Lentivtrus tran.~Activator Proteins Ho~tu~ ~ s71 is a Ehylngcueticellly Distinct Human Endogenous Reteovtgal 1Rlement with Structural mad Sequence Homology to Simian Sarcoma Virus (SSV). Distinct Fercedoxins from Rhodobacter-Capsulstus -Complete Amino Acid Sequences and Molecular Evolution Complete Amino Acid Sequence and Homologies of Human Erythrocyte Membrane Protein Band 4.2. Identification of Two Highly Conserved Amino Acid Sequences Amon~ the ~x-subunits and Molecular ~ The Predicted Amino Acid Sequence of ct-lnternexin is that of a novel Neuronal lntegmedla~ ~ent Protein Inttaspecific Evolution of a Gene Family Coding for Urinary Proteins Attalysi~ of CDNA for Human ~ AJudgyrin I~dicltes a Repeated Structure with Homology to Tissue-Differentiation a~td Cell-Cycle Control Protein abstract: nan url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7173161/ doi: 10.1016/0959-440x(91)90051-t id: cord-300796-rmjv56ia author: nan title: The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation date: 1990-09-01 words: 8031.0 sentences: 405.0 pages: flesch: 57.0 cache: ./cache/cord-300796-rmjv56ia.txt txt: ./txt/cord-300796-rmjv56ia.txt summary: In this work we show that the p62 protein of Semliki Forest virus contains an uncleaved signal sequence at its NH2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. As the glycosylation of the signal sequence most likely occurs after its release from the ER membrane our results suggest that this region has no role in completing the transfer process. Furthermore, the p62-reporter hybrid should be translocated across microsomal membranes and possibly glycosylated at Asn~3 of the p62 sequence if the 40 residues long NH2-terminal p62 peptide carries a signal sequence. This must involve Asn~3 of the p62 peptide as it is part of the only potential glycosylation site on the hybrid polypeptides (Garoff et al., 1980 ; references on dhfr sequence in legend to Fig. 1) , Finally, we can also conclude that the p62 signal sequence does not provide a stable membrane anchor to the translocated chain. abstract: So far it has been demonstrated that the signal sequence of proteins which are made at the ER functions both at the level of protein targeting to the ER and in initiation of chain translocation across the ER membrane. However, its possible role in completing the process of chain transfer (see Singer, S. J., P. A. Maher, and M. P. Yaffe. Proc. Natl. Acad. Sci. USA. 1987. 84:1015-1019) has remained elusive. In this work we show that the p62 protein of Semliki Forest virus contains an uncleaved signal sequence at its NH2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. As the glycosylation of the signal sequence most likely occurs after its release from the ER membrane our results suggest that this region has no role in completing the transfer process. url: https://www.ncbi.nlm.nih.gov/pubmed/2391367/ doi: nan id: cord-256608-ajzk86rq author: van Weezep, Erik title: PCR diagnostics: In silico validation by an automated tool using freely available software programs date: 2019-05-13 words: 4950.0 sentences: 258.0 pages: flesch: 54.0 cache: ./cache/cord-256608-ajzk86rq.txt txt: ./txt/cord-256608-ajzk86rq.txt summary: An alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the PCR test as search queries and the program SSEARCH available in the FASTA sequence analysis package (Brenner et al., 1998; Pearson, 1991; Pearson et al., 2017; . The in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective PCR test. To demonstrate the suitability of our in-house developed software tool PCRv, we determined the in silico sensitivity and specificity of three PCR tests for West Nile virus (WNV) recommended by the World Organisation for Animal Health (OIE) (Eiden et al., 2010; Johnson et al., 2001) . abstract: PCR diagnostics are often the first line of laboratory diagnostics and are regularly designed to either differentiate between or detect all pathogen variants of a family, genus or species. The ideal PCR test detects all variants of the target pathogen, including newly discovered and emerging variants, while closely related pathogens and their variants should not be detected. This is challenging as pathogens show a high degree of genetic variation due to genetic drift, adaptation and evolution. Therefore, frequent re-evaluation of PCR diagnostics is needed to monitor its usefulness. Validation of PCR diagnostics recognizes three stages, in silico, in vitro and in vivo validation. In vitro and in vivo testing are usually costly, labour intensive and imply a risk of handling dangerous pathogens. In silico validation reduces this burden. In silico validation checks primers and probes by comparing their sequences with available nucleotide sequences. In recent years the amount of available sequences has dramatically increased by high throughput and deep sequencing projects. This makes in silico validation more informative, but also more computing intensive. To facilitate validation of PCR tests, a software tool named PCRv was developed. PCRv consists of a user friendly graphical user interface and coordinates the use of the software programs ClustalW and SSEARCH in order to perform in silico validation of PCR tests of different formats. Use of internal control sequences makes the analysis compliant to laboratory quality control systems. Finally, PCRv generates a validation report that includes an overview as well as a list of detailed results. In-house developed, published and OIE-recommended PCR tests were easily (re-) evaluated by use of PCRv. To demonstrate the power of PCRv, in silico validation of several PCR tests are shown and discussed. url: https://doi.org/10.1016/j.jviromet.2019.05.002 doi: 10.1016/j.jviromet.2019.05.002 ==== make-pages.sh questions [ERIC WAS HERE] ==== make-pages.sh search /data-disk/reader-compute/reader-cord/bin/make-pages.sh: line 77: /data-disk/reader-compute/reader-cord/tmp/search.htm: No such file or directory Traceback (most recent call last): File "/data-disk/reader-compute/reader-cord/bin/tsv2htm-search.py", line 51, in with open( TEMPLATE, 'r' ) as handle : htm = handle.read() FileNotFoundError: [Errno 2] No such file or directory: '/data-disk/reader-compute/reader-cord/tmp/search.htm' ==== make-pages.sh topic modeling corpus Zipping study carrel