Carrel name: keyword-sequence-cord
Creating study carrel named keyword-sequence-cord
Initializing database
         file: cache/cord-000257-ampip7od.json
          key: cord-000257-ampip7od
      authors: Bagowski, Christoph P; Bruins, Wouter; te Velthuis, Aartjan J.W
        title: The Nature of Protein Domain Evolution: Shaping the Interaction Network
         date: 2010-08-17
      journal: Curr Genomics
          DOI: 10.2174/138920210791616725
          sha: 
       doc_id: 257
     cord_uid: ampip7od

         file: cache/cord-016293-pyb00pt5.json
          key: cord-016293-pyb00pt5
      authors: Newell-McGloughlin, Martina; Re, Edward
        title: The flowering of the age of Biotechnology 1990–2000
         date: 2006
      journal: The Evolution of Biotechnology
          DOI: 10.1007/1-4020-5149-2_4
          sha: 
       doc_id: 16293
     cord_uid: pyb00pt5

         file: cache/cord-016798-tv2ntug6.json
          key: cord-016798-tv2ntug6
      authors: Gautam, Ablesh; Tiwari, Ashish; Malik, Yashpal Singh
        title: Bioinformatics Applications in Advancing Animal Virus Research
         date: 2019-06-06
      journal: Recent Advances in Animal Virology
          DOI: 10.1007/978-981-13-9073-9_23
          sha: 
       doc_id: 16798
     cord_uid: tv2ntug6

         file: cache/cord-000473-jpow6iw1.json
          key: cord-000473-jpow6iw1
      authors: Astrovskaya, Irina; Tork, Bassam; Mangul, Serghei; Westbrooks, Kelly; Măndoiu, Ion; Balfe, Peter; Zelikovsky, Alex
        title: Inferring viral quasispecies spectra from 454 pyrosequencing reads
         date: 2011-07-28
      journal: BMC Bioinformatics
          DOI: 10.1186/1471-2105-12-s6-s1
          sha: 
       doc_id: 473
     cord_uid: jpow6iw1

         file: cache/cord-025610-7vouj8pp.json
          key: cord-025610-7vouj8pp
      authors: Latif, Seemab; Bashir, Sarmad; Agha, Mir Muntasar Ali; Latif, Rabia
        title: Backward-Forward Sequence Generative Network for Multiple Lexical Constraints
         date: 2020-05-06
      journal: Artificial Intelligence Applications and Innovations
          DOI: 10.1007/978-3-030-49186-4_4
          sha: 
       doc_id: 25610
     cord_uid: 7vouj8pp

         file: cache/cord-004862-yv76yvy5.json
          key: cord-004862-yv76yvy5
      authors: Demers, G. William; Matunis, Michael J.; Hardison, Ross C.
        title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin
         date: 1989
      journal: J Mol Evol
          DOI: 10.1007/bf02106177
          sha: 
       doc_id: 4862
     cord_uid: yv76yvy5

         file: cache/cord-025948-6dsx7pey.json
          key: cord-025948-6dsx7pey
      authors: Maitra, Arindam; Sarkar, Mamta Chawla; Raheja, Harsha; Biswas, Nidhan K; Chakraborti, Sohini; Singh, Animesh Kumar; Ghosh, Shekhar; Sarkar, Sumanta; Patra, Subrata; Mondal, Rajiv Kumar; Ghosh, Trinath; Chatterjee, Ananya; Banu, Hasina; Majumdar, Agniva; Chinnaswamy, Sreedhar; Srinivasan, Narayanaswamy; Dutta, Shanta; Das, Saumitra
        title: Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility
         date: 2020-06-04
      journal: J Biosci
          DOI: 10.1007/s12038-020-00046-1
          sha: 
       doc_id: 25948
     cord_uid: 6dsx7pey

         file: cache/cord-014674-ey29970v.json
          key: cord-014674-ey29970v
      authors: nan
        title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002
         date: 2003
      journal: Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz
          DOI: 10.1007/s00103-003-0614-5
          sha: 
       doc_id: 14674
     cord_uid: ey29970v

         file: cache/cord-018459-isbc1r2o.json
          key: cord-018459-isbc1r2o
      authors: Munjal, Geetika; Hanmandlu, Madasu; Srivastava, Sangeet
        title: Phylogenetics Algorithms and Applications
         date: 2018-12-10
      journal: Ambient Communications and Computer Systems
          DOI: 10.1007/978-981-13-5934-7_17
          sha: 
       doc_id: 18459
     cord_uid: isbc1r2o

         file: cache/cord-015850-ef6svn8f.json
          key: cord-015850-ef6svn8f
      authors: Saitou, Naruya
        title: Eukaryote Genomes
         date: 2013-08-22
      journal: Introduction to Evolutionary Genomics
          DOI: 10.1007/978-1-4471-5304-7_8
          sha: 
       doc_id: 15850
     cord_uid: ef6svn8f

         file: cache/cord-012975-u87ol3fs.json
          key: cord-012975-u87ol3fs
      authors: Ogiwara, Atsushi; Uchiyama, Ikuo; Seto, Yasuhiko; Kanehisa, Minoru
        title: Construction of a dictionary of sequence motifs that characterize groups of related proteins
         date: 1992-09-17
      journal: Protein Eng
          DOI: 10.1093/protein/5.6.479
          sha: 
       doc_id: 12975
     cord_uid: u87ol3fs

         file: cache/cord-033010-o5kiadfm.json
          key: cord-033010-o5kiadfm
      authors: Durojaye, Olanrewaju Ayodeji; Mushiana, Talifhani; Uzoeto, Henrietta Onyinye; Cosmas, Samuel; Udowo, Victor Malachy; Osotuyi, Abayomi Gaius; Ibiang, Glory Omini; Gonlepa, Miapeh Kous
        title: Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study
         date: 2020-10-02
      journal: Egypt J Med Hum Genet
          DOI: 10.1186/s43042-020-00081-5
          sha: 
       doc_id: 33010
     cord_uid: o5kiadfm

         file: cache/cord-256608-ajzk86rq.json
          key: cord-256608-ajzk86rq
      authors: van Weezep, Erik; Kooi, Engbert A.; van Rijn, Piet A.
        title: PCR diagnostics: In silico validation by an automated tool using freely available software programs
         date: 2019-05-13
      journal: J Virol Methods
          DOI: 10.1016/j.jviromet.2019.05.002
          sha: 
       doc_id: 256608
     cord_uid: ajzk86rq

         file: cache/cord-103029-nc5yf6x4.json
          key: cord-103029-nc5yf6x4
      authors: Wichmann, Stefan; Scherer, Siegfried; Ardern, Zachary
        title: Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank
         date: 2020-09-25
      journal: bioRxiv
          DOI: 10.1101/2020.09.25.312959
          sha: 
       doc_id: 103029
     cord_uid: nc5yf6x4

         file: cache/cord-001340-kqcx7lrq.json
          key: cord-001340-kqcx7lrq
      authors: Ladner, Jason T.; Beitzel, Brett; Chain, Patrick S. G.; Davenport, Matthew G.; Donaldson, Eric; Frieman, Matthew; Kugelman, Jeffrey; Kuhn, Jens H.; O’Rear, Jules; Sabeti, Pardis C.; Wentworth, David E.; Wiley, Michael R.; Yu, Guo-Yun; Sozhamannan, Shanmuga; Bradburne, Christopher; Palacios, Gustavo
        title: Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing
         date: 2014-06-17
      journal: mBio
          DOI: 10.1128/mbio.01360-14
          sha: 
       doc_id: 1340
     cord_uid: kqcx7lrq

         file: cache/cord-002473-2kpxhzbe.json
          key: cord-002473-2kpxhzbe
      authors: Das, Jayanta Kumar; Pal Choudhury, Pabitra
        title: Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach
         date: 2017-03-31
      journal: PLoS One
          DOI: 10.1371/journal.pone.0175031
          sha: 
       doc_id: 2473
     cord_uid: 2kpxhzbe

         file: cache/cord-010260-8lnpujip.json
          key: cord-010260-8lnpujip
      authors: Anthonsen, Henrik W.; Baptista, António; Drabløs, Finn; Martel, Paulo; Petersen, Steffen B.
        title: The blind watchmaker and rational protein engineering
         date: 1994-08-31
      journal: J Biotechnol
          DOI: 10.1016/0168-1656(94)90152-x
          sha: 
       doc_id: 10260
     cord_uid: 8lnpujip

         file: cache/cord-010161-bcuec2fz.json
          key: cord-010161-bcuec2fz
      authors: Matson, David O.
        title: IV, 6. Calicivirus RNA recombination
         date: 2004-09-14
      journal: Perspect Med Virol
          DOI: 10.1016/s0168-7069(03)09032-3
          sha: 
       doc_id: 10161
     cord_uid: bcuec2fz

         file: cache/cord-017584-9rx4jlw8.json
          key: cord-017584-9rx4jlw8
      authors: Kim, Kwangsoo; Ryoo, Hong Seo
        title: Selecting Genotyping Oligo Probes Via Logical Analysis of Data
         date: 2007
      journal: Advances in Artificial Intelligence
          DOI: 10.1007/978-3-540-72665-4_8
          sha: 
       doc_id: 17584
     cord_uid: 9rx4jlw8

         file: cache/cord-005060-n901y2d4.json
          key: cord-005060-n901y2d4
      authors: ZHANG, Feiyun; TORIYAMA, Shigemitsu; TAKAHASHI, Mami
        title: Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus
         date: 2001
      journal: J
          DOI: 10.1007/pl00012989
          sha: 
       doc_id: 5060
     cord_uid: n901y2d4

         file: cache/cord-011565-8ncgldaq.json
          key: cord-011565-8ncgldaq
      authors: Elworth, R A Leo; Wang, Qi; Kota, Pavan K; Barberan, C J; Coleman, Benjamin; Balaji, Advait; Gupta, Gaurav; Baraniuk, Richard G; Shrivastava, Anshumali; Treangen, Todd J
        title: To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics
         date: 2020-06-04
      journal: Nucleic Acids Res
          DOI: 10.1093/nar/gkaa265
          sha: 
       doc_id: 11565
     cord_uid: 8ncgldaq

         file: cache/cord-001537-i34vmfpp.json
          key: cord-001537-i34vmfpp
      authors: Lima, Francisco Esmaile de Sales; Cibulski, Samuel Paulo; dos Santos, Helton Fernandes; Teixeira, Thais Fumaco; Varela, Ana Paula Muterle; Roehe, Paulo Michel; Delwart, Eric; Franco, Ana Cláudia
        title: Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil
         date: 2015-02-17
      journal: PLoS One
          DOI: 10.1371/journal.pone.0118070
          sha: 
       doc_id: 1537
     cord_uid: i34vmfpp

         file: cache/cord-256278-jvfjf7aw.json
          key: cord-256278-jvfjf7aw
      authors: Feng, Jie; Hu, Yong; Wan, Ping; Zhang, Aibing; Zhao, Weizhong
        title: New method for comparing DNA primary sequences based on a discrimination measure
         date: 2010-10-21
      journal: Journal of Theoretical Biology
          DOI: 10.1016/j.jtbi.2010.07.040
          sha: 
       doc_id: 256278
     cord_uid: jvfjf7aw

         file: cache/cord-000642-mkwpuav6.json
          key: cord-000642-mkwpuav6
      authors: Moreira, Rebeca; Balseiro, Pablo; Planas, Josep V.; Fuste, Berta; Beltran, Sergi; Novoa, Beatriz; Figueras, Antonio
        title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing
         date: 2012-04-19
      journal: PLoS One
          DOI: 10.1371/journal.pone.0035009
          sha: 
       doc_id: 642
     cord_uid: mkwpuav6

         file: cache/cord-255194-4i9fc0r7.json
          key: cord-255194-4i9fc0r7
      authors: Djikeng, Appolinaire; Halpin, Rebecca; Kuzmickas, Ryan; DePasse, Jay; Feldblyum, Jeremy; Sengamalay, Naomi; Afonso, Claudio; Zhang, Xinsheng; Anderson, Norman G; Ghedin, Elodie; Spiro, David J
        title: Viral genome sequencing by random priming methods
         date: 2008-01-07
      journal: BMC Genomics
          DOI: 10.1186/1471-2164-9-5
          sha: 
       doc_id: 255194
     cord_uid: 4i9fc0r7

         file: cache/cord-016594-lj0us1dq.json
          key: cord-016594-lj0us1dq
      authors: Flower, Darren R.; Davies, Matthew N.; Doytchinova, Irini A.
        title: Identification of Candidate Vaccine Antigens In Silico
         date: 2012-09-28
      journal: Immunomic Discovery of Adjuvants and Candidate Subunit Vaccines
          DOI: 10.1007/978-1-4614-5070-2_3
          sha: 
       doc_id: 16594
     cord_uid: lj0us1dq

         file: cache/cord-023647-dlqs8ay9.json
          key: cord-023647-dlqs8ay9
      authors: nan
        title: Sequences and topology
         date: 2003-03-21
      journal: Curr Opin Struct Biol
          DOI: 10.1016/0959-440x(91)90051-t
          sha: 
       doc_id: 23647
     cord_uid: dlqs8ay9

         file: cache/cord-022348-w7z97wir.json
          key: cord-022348-w7z97wir
      authors: Sola, Monica; Wain-Hobson, Simon
        title: Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing?
         date: 2007-09-02
      journal: Origin and Evolution of Viruses
          DOI: 10.1016/b978-012220360-2/50007-6
          sha: 
       doc_id: 22348
     cord_uid: w7z97wir

         file: cache/cord-264296-0x90yubt.json
          key: cord-264296-0x90yubt
      authors: Sawmya, Shashata; Saha, Arpita; Tasnim, Sadia; Anjum, Naser; Toufikuzzaman, Md.; Rafid, Ali Haisam Muhammad; Rahman, Mohammad Saifur; Rahman, M. Sohel
        title: Analyzing hCov genome sequences: Applying Machine Intelligence and beyond
         date: 2020-06-03
      journal: bioRxiv
          DOI: 10.1101/2020.06.03.131987
          sha: 
       doc_id: 264296
     cord_uid: 0x90yubt

         file: cache/cord-035033-osjy88rc.json
          key: cord-035033-osjy88rc
      authors: Aydin, Berkay; Boubrahimi, Soukaina Filali; Kucuk, Ahmet; Nezamdoust, Bita; Angryk, Rafal A.
        title: Spatiotemporal event sequence discovery without thresholds
         date: 2020-11-09
      journal: Geoinformatica
          DOI: 10.1007/s10707-020-00427-6
          sha: 
       doc_id: 35033
     cord_uid: osjy88rc

         file: cache/cord-203232-1nnqx1g9.json
          key: cord-203232-1nnqx1g9
      authors: Canturk, Semih; Singh, Aman; St-Amant, Patrick; Behrmann, Jason
        title: Machine-Learning Driven Drug Repurposing for COVID-19
         date: 2020-06-25
      journal: nan
          DOI: nan
          sha: 
       doc_id: 203232
     cord_uid: 1nnqx1g9

         file: cache/cord-264135-s2u76pvk.json
          key: cord-264135-s2u76pvk
      authors: Patel, Amrutlal K.; Pandit, Ramesh J.; Thakkar, Jalpa R.; Hinsu, Ankit T.; Pandey, Vinod C.; Pal, Joy K.; Prajapati, Kantilal S.; Jakhesara, Subhash J.; Joshi, Chaitanya G.
        title: Complete genome sequence analysis of chicken astrovirus isolate from India
         date: 2016-12-23
      journal: Vet Res Commun
          DOI: 10.1007/s11259-016-9673-6
          sha: 
       doc_id: 264135
     cord_uid: s2u76pvk

         file: cache/cord-266288-buc4dd5y.json
          key: cord-266288-buc4dd5y
      authors: Dong, Rui; He, Lily; He, Rong Lucy; Yau, Stephen S.-T.
        title: A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance
         date: 2019-04-09
      journal: Front Genet
          DOI: 10.3389/fgene.2019.00234
          sha: 
       doc_id: 266288
     cord_uid: buc4dd5y

         file: cache/cord-001786-ybd8hi8y.json
          key: cord-001786-ybd8hi8y
      authors: Dutilh, Bas E
        title: Metagenomic ventures into outer sequence space
         date: 2014-12-15
      journal: Bacteriophage
          DOI: 10.4161/21597081.2014.979664
          sha: 
       doc_id: 1786
     cord_uid: ybd8hi8y

         file: cache/cord-018133-2otxft31.json
          key: cord-018133-2otxft31
      authors: Altman, Russ B.; Mooney, Sean D.
        title: Bioinformatics
         date: 2006
      journal: Biomedical Informatics
          DOI: 10.1007/0-387-36278-9_22
          sha: 
       doc_id: 18133
     cord_uid: 2otxft31

         file: cache/cord-266960-kyx6xhvj.json
          key: cord-266960-kyx6xhvj
      authors: Temple, Mark D.
        title: Real-time audio and visual display of the Coronavirus genome
         date: 2020-10-02
      journal: BMC Bioinformatics
          DOI: 10.1186/s12859-020-03760-7
          sha: 
       doc_id: 266960
     cord_uid: kyx6xhvj

         file: cache/cord-003316-r5te5xob.json
          key: cord-003316-r5te5xob
      authors: Balloux, Francois; Brønstad Brynildsrud, Ola; van Dorp, Lucy; Shaw, Liam P.; Chen, Hongbin; Harris, Kathryn A.; Wang, Hui; Eldholm, Vegard
        title: From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic
         date: 2018-12-17
      journal: Trends Microbiol
          DOI: 10.1016/j.tim.2018.08.004
          sha: 
       doc_id: 3316
     cord_uid: r5te5xob

         file: cache/cord-300796-rmjv56ia.json
          key: cord-300796-rmjv56ia
      authors: nan
        title: The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation
         date: 1990-09-01
      journal: J Cell Biol
          DOI: nan
          sha: 
       doc_id: 300796
     cord_uid: rmjv56ia

         file: cache/cord-017932-vmtjc8ct.json
          key: cord-017932-vmtjc8ct
      authors: Georgiev, Vassil St.
        title: Genomic and Postgenomic Research
         date: 2009
      journal: National Institute of Allergy and Infectious Diseases, NIH
          DOI: 10.1007/978-1-60327-297-1_25
          sha: 
       doc_id: 17932
     cord_uid: vmtjc8ct

         file: cache/cord-265857-fs6dj3dp.json
          key: cord-265857-fs6dj3dp
      authors: Liu, Yu-Tsueng
        title: Infectious Disease Genomics
         date: 2010-12-24
      journal: Genetics and Evolution of Infectious Disease
          DOI: 10.1016/b978-0-12-384890-1.00010-8
          sha: 
       doc_id: 265857
     cord_uid: fs6dj3dp

         file: cache/cord-010273-0c56x9f5.json
          key: cord-010273-0c56x9f5
      authors: Simmonds, Peter
        title: Virology of hepatitis C virus
         date: 2001-10-10
      journal: Clin Ther
          DOI: 10.1016/s0149-2918(96)80193-7
          sha: 
       doc_id: 10273
     cord_uid: 0c56x9f5

         file: cache/cord-010499-yefxrj30.json
          key: cord-010499-yefxrj30
      authors: Yelverton, Elizabeth; Lindsley, Dale; Yamauchi, Phil; Gallant, Jonathan A.
        title: The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli
         date: 2006-10-27
      journal: Mol Microbiol
          DOI: 10.1111/j.1365-2958.1994.tb00310.x
          sha: 
       doc_id: 10499
     cord_uid: yefxrj30

         file: cache/cord-263987-ff6kor0c.json
          key: cord-263987-ff6kor0c
      authors: Holmes, Ian H.
        title: Solving the master equation for Indels
         date: 2017-05-12
      journal: BMC Bioinformatics
          DOI: 10.1186/s12859-017-1665-1
          sha: 
       doc_id: 263987
     cord_uid: ff6kor0c

         file: cache/cord-022494-d66rz6dc.json
          key: cord-022494-d66rz6dc
      authors: Webb, B.; Eswar, N.; Fan, H.; Khuri, N.; Pieper, U.; Dong, G.Q.; Sali, A.
        title: Comparative Modeling of Drug Target Proteins
         date: 2014-10-01
      journal: Reference Module in Chemistry, Molecular Sciences and Chemical Engineering
          DOI: 10.1016/b978-0-12-409547-2.11133-3
          sha: 
       doc_id: 22494
     cord_uid: d66rz6dc

         file: cache/cord-193910-7p3f3znj.json
          key: cord-193910-7p3f3znj
      authors: Zhang, Xiangxie; Beinke, Ben; Kindhi, Berlian Al; Wiering, Marco
        title: Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification
         date: 2020-11-01
      journal: nan
          DOI: nan
          sha: 
       doc_id: 193910
     cord_uid: 7p3f3znj

         file: cache/cord-253436-dz84icdc.json
          key: cord-253436-dz84icdc
      authors: Wille, Michelle; Muradrasoli, Shaman; Nilsson, Anna; Järhult, Josef D.
        title: High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl
         date: 2016-03-03
      journal: PLoS One
          DOI: 10.1371/journal.pone.0150198
          sha: 
       doc_id: 253436
     cord_uid: dz84icdc

         file: cache/cord-255371-o9oxchq6.json
          key: cord-255371-o9oxchq6
      authors: Nguyen, Thanh Thi; Pathirana, Pubudu N.; Nguyen, Thin; Nguyen, Henry; Bhatti, Asim; Nguyen, Dinh C.; Nguyen, Dung Tien; Nguyen, Ngoc Duy; Creighton, Douglas; Abdelrazek, Mohamed
        title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus)
         date: 2020-07-10
      journal: bioRxiv
          DOI: 10.1101/2020.07.10.171769
          sha: 
       doc_id: 255371
     cord_uid: o9oxchq6

         file: cache/cord-017354-cndb031c.json
          key: cord-017354-cndb031c
      authors: Janies, D.; Pol, D.
        title: Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases
         date: 2008
      journal: Tutorials in Mathematical Biosciences IV
          DOI: 10.1007/978-3-540-74331-6_2
          sha: 
       doc_id: 17354
     cord_uid: cndb031c

         file: cache/cord-014461-2ubh9u8r.json
          key: cord-014461-2ubh9u8r
      authors: Nelson, Oranmiyan W.; Garrity, George M.
        title: Genome sequences published outside of Standards in Genomic Sciences, July - October 2012
         date: 2012-10-10
      journal: Stand Genomic Sci
          DOI: 10.4056/sigs.3416907
          sha: 
       doc_id: 14461
     cord_uid: 2ubh9u8r

         file: cache/cord-014462-11ggaqf1.json
          key: cord-014462-11ggaqf1
      authors: nan
        title: Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh
         date: 2011-04-21
      journal: Indian J Virol
          DOI: 10.1007/s13337-011-0027-2
          sha: 
       doc_id: 14462
     cord_uid: 11ggaqf1

         file: cache/cord-268549-2lg8i9r1.json
          key: cord-268549-2lg8i9r1
      authors: Dai, Qi; Guo, Xiaodong; Li, Lihua
        title: Sequence comparison via polar coordinates representation and curve tree
         date: 2012-01-07
      journal: Journal of Theoretical Biology
          DOI: 10.1016/j.jtbi.2011.09.030
          sha: 
       doc_id: 268549
     cord_uid: 2lg8i9r1

         file: cache/cord-001974-wjf3c7a7.json
          key: cord-001974-wjf3c7a7
      authors: Friis-Nielsen, Jens; Kjartansdóttir, Kristín Rós; Mollerup, Sarah; Asplund, Maria; Mourier, Tobias; Jensen, Randi Holm; Hansen, Thomas Arn; Rey-Iglesia, Alba; Richter, Stine Raith; Nielsen, Ida Broman; Alquezar-Planas, David E.; Olsen, Pernille V. S.; Vinner, Lasse; Fridholm, Helena; Nielsen, Lars Peter; Willerslev, Eske; Sicheritz-Pontén, Thomas; Lund, Ole; Hansen, Anders Johannes; Izarzugaza, Jose M. G.; Brunak, Søren
        title: Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
         date: 2016-02-19
      journal: Viruses
          DOI: 10.3390/v8020053
          sha: 
       doc_id: 1974
     cord_uid: wjf3c7a7

         file: cache/cord-275258-azpg5yrh.json
          key: cord-275258-azpg5yrh
      authors: Mead, Dylan J.T.; Lunagomez, Simón; Gatherer, Derek
        title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling
         date: 2019-07-26
      journal: J Mol Graph Model
          DOI: 10.1016/j.jmgm.2019.07.014
          sha: 
       doc_id: 275258
     cord_uid: azpg5yrh

         file: cache/cord-321386-u1imic5l.json
          key: cord-321386-u1imic5l
      authors: Li, Chun; Zhao, Jialing; Wang, Changzhong; Yao, Yuhua
        title: Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
         date: 2018-02-17
      journal: Comb Chem High Throughput Screen
          DOI: 10.2174/1386207321666180130100838
          sha: 
       doc_id: 321386
     cord_uid: u1imic5l

         file: cache/cord-023208-w99gc5nx.json
          key: cord-023208-w99gc5nx
      authors: nan
        title: Poster Presentation Abstracts
         date: 2006-09-01
      journal: J Pept Sci
          DOI: 10.1002/psc.797
          sha: 
       doc_id: 23208
     cord_uid: w99gc5nx

         file: cache/cord-306725-0vam15pt.json
          key: cord-306725-0vam15pt
      authors: Li, Hao; Zhang, Bin; Yue, Hua; Tang, Cheng
        title: First detection and genomic characteristics of bovine torovirus in dairy calves in China
         date: 2020-05-09
      journal: Arch Virol
          DOI: 10.1007/s00705-020-04657-9
          sha: 
       doc_id: 306725
     cord_uid: 0vam15pt

         file: cache/cord-027316-echxuw74.json
          key: cord-027316-echxuw74
      authors: Modarresi, Kourosh
        title: Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model
         date: 2020-05-22
      journal: Computational Science - ICCS 2020
          DOI: 10.1007/978-3-030-50420-5_20
          sha: 
       doc_id: 27316
     cord_uid: echxuw74

         file: cache/cord-213136-euv6pqh5.json
          key: cord-213136-euv6pqh5
      authors: Singh, Kulveer; Rabin, Yitzhak
        title: Sequence Effects on Internal Structure of Droplets of Associative Polymers
         date: 2020-05-17
      journal: nan
          DOI: nan
          sha: 
       doc_id: 213136
     cord_uid: euv6pqh5

         file: cache/cord-103297-4stnx8dw.json
          key: cord-103297-4stnx8dw
      authors: Widrich, Michael; Schäfl, Bernhard; Pavlović, Milena; Ramsauer, Hubert; Gruber, Lukas; Holzleitner, Markus; Brandstetter, Johannes; Sandve, Geir Kjetil; Greiff, Victor; Hochreiter, Sepp; Klambauer, Günter
        title: Modern Hopfield Networks and Attention for Immune Repertoire Classification
         date: 2020-08-17
      journal: bioRxiv
          DOI: 10.1101/2020.04.12.038158
          sha: 
       doc_id: 103297
     cord_uid: 4stnx8dw

          key: cord-193356-hqbstgg7
      authors: Widrich, Michael; Schafl, Bernhard; Ramsauer, Hubert; Pavlovi'c, Milena; Gruber, Lukas; Holzleitner, Markus; Brandstetter, Johannes; Sandve, Geir Kjetil; Greiff, Victor; Hochreiter, Sepp; Klambauer, Gunter
        title: Modern Hopfield Networks and Attention for Immune Repertoire Classification
         date: 2020-07-16
      journal: nan
          DOI: nan
          sha: 
       doc_id: 193356
     cord_uid: hqbstgg7

         file: cache/cord-252347-vnn4135b.json
          key: cord-252347-vnn4135b
      authors: Lee, Wai-Ming; Kiesner, Christin; Pappas, Tressa; Lee, Iris; Grindle, Kris; Jartti, Tuomas; Jakiela, Bogdan; Lemanske, Robert F.; Shult, Peter A.; Gern, James E.
        title: A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants
         date: 2007-10-03
      journal: PLoS One
          DOI: 10.1371/journal.pone.0000966
          sha: 
       doc_id: 252347
     cord_uid: vnn4135b

         file: cache/cord-031957-df4luh5v.json
          key: cord-031957-df4luh5v
      authors: dos Santos-Silva, Carlos André; Zupin, Luisa; Oliveira-Lima, Marx; Vilela, Lívia Maria Batista; Bezerra-Neto, João Pacifico; Ferreira-Neto, José Ribamar; Ferreira, José Diogo Cavalcanti; de Oliveira-Silva, Roberta Lane; Pires, Carolline de Jesús; Aburjaile, Flavia Figueira; de Oliveira, Marianne Firmino; Kido, Ederson Akio; Crovella, Sergio; Benko-Iseppon, Ana Maria
        title: Plant Antimicrobial Peptides: State of the Art, In Silico Prediction and Perspectives in the Omics Era
         date: 2020-09-02
      journal: Bioinform Biol Insights
          DOI: 10.1177/1177932220952739
          sha: 
       doc_id: 31957
     cord_uid: df4luh5v

         file: cache/cord-264746-gfn312aa.json
          key: cord-264746-gfn312aa
      authors: Muse, Spencer
        title: GENOMICS AND BIOINFORMATICS
         date: 2012-03-29
      journal: Introduction to Biomedical Engineering
          DOI: 10.1016/b978-0-12-238662-6.50015-x
          sha: 
       doc_id: 264746
     cord_uid: gfn312aa

         file: cache/cord-267500-x3u9i1vq.json
          key: cord-267500-x3u9i1vq
      authors: Rose, Rebecca; Constantinides, Bede; Tapinos, Avraam; Robertson, David L; Prosperi, Mattia
        title: Challenges in the analysis of viral metagenomes
         date: 2016-08-03
      journal: Virus Evol
          DOI: 10.1093/ve/vew022
          sha: 
       doc_id: 267500
     cord_uid: x3u9i1vq

         file: cache/cord-311240-o0zyt2vb.json
          key: cord-311240-o0zyt2vb
      authors: Motayo, Babatunde Olarenwaju; Oluwasemowo, Olukunle Oluwapamilerin; Akinduti, Paul Akiniyi; Olusola, Babatunde Adebiyi; Aerege, Olumide T; Faneye, Adedayo Omotayo
        title: Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences
         date: 2020-07-27
      journal: bioRxiv
          DOI: 10.1101/2020.07.27.222901
          sha: 
       doc_id: 311240
     cord_uid: o0zyt2vb

         file: cache/cord-321715-bkfkmtld.json
          key: cord-321715-bkfkmtld
      authors: Redelings, Benjamin D; Suchard, Marc A
        title: Incorporating indel information into phylogeny estimation for rapidly emerging pathogens
         date: 2007-03-14
      journal: BMC Evol Biol
          DOI: 10.1186/1471-2148-7-40
          sha: 
       doc_id: 321715
     cord_uid: bkfkmtld

         file: cache/cord-311839-61djk4bs.json
          key: cord-311839-61djk4bs
      authors: Wei, Dan; Jiang, Qingshan; Wei, Yanjie; Wang, Shengrui
        title: A novel hierarchical clustering algorithm for gene sequences
         date: 2012-07-23
      journal: BMC Bioinformatics
          DOI: 10.1186/1471-2105-13-174
          sha: 
       doc_id: 311839
     cord_uid: 61djk4bs

         file: cache/cord-018963-2lia97db.json
          key: cord-018963-2lia97db
      authors: Xu, Ying; Liu, Zhijie; Cai, Liming; Xu, Dong
        title: Protein Structure Prediction by Protein Threading
         date: 2010-04-29
      journal: Computational Methods for Protein Structure Prediction and Modeling
          DOI: 10.1007/978-0-387-68825-1_1
          sha: 
       doc_id: 18963
     cord_uid: 2lia97db

         file: cache/cord-321762-7kiahjyy.json
          key: cord-321762-7kiahjyy
      authors: Nandy, Ashesh
        title: Chapter 5 The GRANCH Techniques for Analysis of DNA, RNA and Protein Sequences
         date: 2015-12-31
      journal: Advances in Mathematical Chemistry and Applications
          DOI: 10.1016/b978-1-68108-053-6.50005-3
          sha: 
       doc_id: 321762
     cord_uid: 7kiahjyy

         file: cache/cord-102766-n6mpdhyu.json
          key: cord-102766-n6mpdhyu
      authors: Alam, Md. Nafis Ul; Chowdhury, Umar Faruq
        title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses
         date: 2020-06-25
      journal: bioRxiv
          DOI: 10.1101/2020.06.25.170779
          sha: 
       doc_id: 102766
     cord_uid: n6mpdhyu

         file: cache/cord-254942-g51mjj2b.json
          key: cord-254942-g51mjj2b
      authors: Touati, Rabeb; Tajouri, Asma; Mesaoudi, Imen; Oueslati, Afef Elloumi; Lachiri, Zied; Kharrat, Maher
        title: New methodology for repetitive sequences identification in human X and Y chromosomes
         date: 2020-10-19
      journal: Biomed Signal Process Control
          DOI: 10.1016/j.bspc.2020.102207
          sha: 
       doc_id: 254942
     cord_uid: g51mjj2b

         file: cache/cord-321150-ev6acl7b.json
          key: cord-321150-ev6acl7b
      authors: Lam, Ha Minh; Ratmann, Oliver; Boni, Maciej F
        title: Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm
         date: 2017-10-03
      journal: Mol Biol Evol
          DOI: 10.1093/molbev/msx263
          sha: 
       doc_id: 321150
     cord_uid: ev6acl7b

         file: cache/cord-302798-q0mbngqy.json
          key: cord-302798-q0mbngqy
      authors: Ge, Junwei; Gu, Shanshan; Cui, Xingyang; Zhao, Lili; Ma, Dexing; Shi, Yunjia; Wang, Yuanzhi; Lu, Taofeng; Chen, Hongyan
        title: Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China
         date: 2018-06-14
      journal: Arch Virol
          DOI: 10.1007/s00705-018-3908-5
          sha: 
       doc_id: 302798
     cord_uid: q0mbngqy

         file: cache/cord-266794-oyppubq5.json
          key: cord-266794-oyppubq5
      authors: Zhang, Dachuan; Zhang, Tong; Liu, Sheng; Sun, Dandan; Ding, Shaozhen; Cheng, Xingxiang; Cai, Pengli; Ren, Ailin; Han, Mengying; Liu, Dongliang; Jia, Cancan; Gong, Linlin; Zhang, Rui; Xing, Huadong; Tu, Weizhong; Chen, Junni; Hu, Qian-Nan
        title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model
         date: 2020-09-01
      journal: Bioinformatics
          DOI: 10.1093/bioinformatics/btaa767
          sha: 
       doc_id: 266794
     cord_uid: oyppubq5

         file: cache/cord-300807-9u8idlon.json
          key: cord-300807-9u8idlon
      authors: Tong, Joo Chuan; Ranganathan, Shoba
        title: 7 Infectious disease informatics
         date: 2013-12-31
      journal: Computer-Aided Vaccine Design
          DOI: 10.1533/9781908818416.99
          sha: 
       doc_id: 300807
     cord_uid: 9u8idlon

         file: cache/cord-280881-5o38ihe0.json
          key: cord-280881-5o38ihe0
      authors: Wlodawer, Alexander; Durell, Stewart R; Li, Mi; Oyama, Hiroshi; Oda, Kohei; Dunn, Ben M
        title: A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases
         date: 2003-11-11
      journal: BMC Struct Biol
          DOI: 10.1186/1472-6807-3-8
          sha: 
       doc_id: 280881
     cord_uid: 5o38ihe0

         file: cache/cord-274056-9t3kneoo.json
          key: cord-274056-9t3kneoo
      authors: Abd Elwahaab, Marwa A.; Abo-Elkhier, Mervat M.; Abo el Maaty, Moheb I.
        title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector
         date: 2019-05-08
      journal: Biomed Res Int
          DOI: 10.1155/2019/8702968
          sha: 
       doc_id: 274056
     cord_uid: 9t3kneoo

         file: cache/cord-325985-xfzhn1n1.json
          key: cord-325985-xfzhn1n1
      authors: Jabado, Omar J.; Liu, Yang; Conlan, Sean; Quan, P. Lan; Hegyi, Hédi; Lussier, Yves; Briese, Thomas; Palacios, Gustavo; Lipkin, W. I.
        title: Comprehensive viral oligonucleotide probe design using conserved protein regions
         date: 2007-12-13
      journal: Nucleic Acids Res
          DOI: 10.1093/nar/gkm1106
          sha: 
       doc_id: 325985
     cord_uid: xfzhn1n1

         file: cache/cord-279528-41atidai.json
          key: cord-279528-41atidai
      authors: Abo-Elkhier, Mervat M.; Abd Elwahaab, Marwa A.; Abo El Maaty, Moheb I.
        title: Measuring Similarity among Protein Sequences Using a New Descriptor
         date: 2019-11-22
      journal: Biomed Res Int
          DOI: 10.1155/2019/2796971
          sha: 
       doc_id: 279528
     cord_uid: 41atidai

         file: cache/cord-301827-a7hnuxy5.json
          key: cord-301827-a7hnuxy5
      authors: Uversky, Vladimir N
        title: A decade and a half of protein intrinsic disorder: Biology still waits for physics
         date: 2013-04-29
      journal: Protein Science
          DOI: 10.1002/pro.2261
          sha: 
       doc_id: 301827
     cord_uid: a7hnuxy5

         file: cache/cord-300149-djclli8n.json
          key: cord-300149-djclli8n
      authors: Ruan, Yijun; Wei, Chia Lin; Ling, Ai Ee; Vega, Vinsensius B; Thoreau, Herve; Se Thoe, Su Yun; Chia, Jer-Ming; Ng, Patrick; Chiu, Kuo Ping; Lim, Landri; Zhang, Tao; Chan, Kwai Peng; Lin Ean, Lynette Oon; Ng, Mah Lee; Leo, Sin Yee; Ng, Lisa FP; Ren, Ee Chee; Stanton, Lawrence W; Long, Philip M; Liu, Edison T
        title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection
         date: 2003-05-24
      journal: Lancet
          DOI: 10.1016/s0140-6736(03)13414-9
          sha: 
       doc_id: 300149
     cord_uid: djclli8n

         file: cache/cord-268467-btfz6ye8.json
          key: cord-268467-btfz6ye8
      authors: Schreiber, Steven S.; Kamahora, Toshio; Lai, Michael M.C.
        title: Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E
         date: 1989-03-31
      journal: Virology
          DOI: 10.1016/0042-6822(89)90050-0
          sha: 
       doc_id: 268467
     cord_uid: btfz6ye8

         file: cache/cord-287658-c2lljdi7.json
          key: cord-287658-c2lljdi7
      authors: Lopez-Rincon, Alejandro; Tonda, Alberto; Mendoza-Maldonado, Lucero; Mulders, Daphne G.J.C.; Molenkamp, Richard; Perez-Romero, Carmina A.; Claassen, Eric; Garssen, Johan; Kraneveld, Aletta D.
        title: Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning
         date: 2020-09-10
      journal: bioRxiv
          DOI: 10.1101/2020.03.13.990242
          sha: 
       doc_id: 287658
     cord_uid: c2lljdi7

         file: cache/cord-304869-l6a68tqn.json
          key: cord-304869-l6a68tqn
      authors: Bielińska-Wąż, Dorota
        title: Graphical and numerical representations of DNA sequences: statistical aspects of similarity
         date: 2011-08-28
      journal: J Math Chem
          DOI: 10.1007/s10910-011-9890-8
          sha: 
       doc_id: 304869
     cord_uid: l6a68tqn

         file: cache/cord-287634-64zqe4cz.json
          key: cord-287634-64zqe4cz
      authors: Al-Ssulami, Abdulrakeeb M.; Azmi, Aqil M.; Hussain, Muhammad
        title: CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents
         date: 2020-01-31
      journal: Genomics
          DOI: 10.1016/j.ygeno.2019.02.002
          sha: 
       doc_id: 287634
     cord_uid: 64zqe4cz

         file: cache/cord-324216-ce3wa889.json
          key: cord-324216-ce3wa889
      authors: Wang, Zheng; Malanoski, Anthony P; Lin, Baochuan; Kidd, Carolyn; Long, Nina C; Blaney, Kate M; Thach, Dzung C; Tibbetts, Clark; Stenger, David A
        title: Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses
         date: 2008-12-01
      journal: BMC Genomics
          DOI: 10.1186/1471-2164-9-577
          sha: 
       doc_id: 324216
     cord_uid: ce3wa889

         file: cache/cord-296691-cg463fbn.json
          key: cord-296691-cg463fbn
      authors: Wang, Ren; Xu, Sheng; Jiang, Yumei; Jiang, Jingwei; Li, Xiaodan; Liang, Lijian; He, Jia; Peng, Feng; Xia, Bing
        title: De novo Sequence Assembly and Characterization of Lycoris aurea Transcriptome Using GS FLX Titanium Platform of 454 Pyrosequencing
         date: 2013-04-09
      journal: PLoS One
          DOI: 10.1371/journal.pone.0060449
          sha: 
       doc_id: 296691
     cord_uid: cg463fbn

         file: cache/cord-302161-ytr7ds8i.json
          key: cord-302161-ytr7ds8i
      authors: Lutz, Mirjam; Steiner, Aline R.; Cattori, Valentino; Hofmann-Lehmann, Regina; Lutz, Hans; Kipar, Anja; Meli, Marina L.
        title: FCoV Viral Sequences of Systemically Infected Healthy Cats Lack Gene Mutations Previously Linked to the Development of FIP
         date: 2020-07-24
      journal: Pathogens
          DOI: 10.3390/pathogens9080603
          sha: 
       doc_id: 302161
     cord_uid: ytr7ds8i

         file: cache/cord-291156-zxg3dsm3.json
          key: cord-291156-zxg3dsm3
      authors: Bernasconi, Anna; Canakoglu, Arif; Pinoli, Pietro; Ceri, Stefano
        title: Empowering Virus Sequences Research through Conceptual Modeling
         date: 2020-05-01
      journal: bioRxiv
          DOI: 10.1101/2020.04.29.067637
          sha: 
       doc_id: 291156
     cord_uid: zxg3dsm3

         file: cache/cord-304607-td0776wj.json
          key: cord-304607-td0776wj
      authors: Paszkiewicz, Konrad H.; Giezen, Mark van der
        title: Omics, Bioinformatics, and Infectious Disease Research
         date: 2010-12-24
      journal: Genetics and Evolution of Infectious Disease
          DOI: 10.1016/b978-0-12-384890-1.00018-2
          sha: 
       doc_id: 304607
     cord_uid: td0776wj

         file: cache/cord-310734-6v7oru2l.json
          key: cord-310734-6v7oru2l
      authors: Bolatti, Elisa M.; Zorec, Tomaž M.; Montani, María E.; Hošnjak, Lea; Chouhy, Diego; Viarengo, Gastón; Casal, Pablo E.; Barquez, Rubén M.; Poljak, Mario; Giri, Adriana A.
        title: A Preliminary Study of the Virome of the South American Free-Tailed Bats (Tadarida brasiliensis) and Identification of Two Novel Mammalian Viruses
         date: 2020-04-09
      journal: Viruses
          DOI: 10.3390/v12040422
          sha: 
       doc_id: 310734
     cord_uid: 6v7oru2l

         file: cache/cord-023209-un2ysc2v.json
          key: cord-023209-un2ysc2v
      authors: nan
        title: Poster Presentations
         date: 2008-10-07
      journal: J Pept Sci
          DOI: 10.1002/psc.1090
          sha: 
       doc_id: 23209
     cord_uid: un2ysc2v

         file: cache/cord-325043-vqjhiv7p.json
          key: cord-325043-vqjhiv7p
      authors: Gorbalenya, Alexander E.; Blinov, Vladimir M.; Donchenko, Alexei P.; Koonin, Eugene V.
        title: An NTP-binding motif is the most conserved sequence in a highly diverged monophyletic group of proteins involved in positive strand RNA viral replication
         date: 1989
      journal: J Mol Evol
          DOI: 10.1007/bf02102483
          sha: 
       doc_id: 325043
     cord_uid: vqjhiv7p

         file: cache/cord-004879-pgyzluwp.json
          key: cord-004879-pgyzluwp
      authors: nan
        title: Programmed cell death
         date: 1994
      journal: Experientia
          DOI: 10.1007/bf02033112
          sha: 
       doc_id: 4879
     cord_uid: pgyzluwp

         file: cache/cord-325750-x7jpsnxg.json
          key: cord-325750-x7jpsnxg
      authors: Mokili, John L; Rohwer, Forest; Dutilh, Bas E
        title: Metagenomics and future perspectives in virus discovery
         date: 2012-01-20
      journal: Curr Opin Virol
          DOI: 10.1016/j.coviro.2011.12.004
          sha: 
       doc_id: 325750
     cord_uid: x7jpsnxg

         file: cache/cord-324021-y1vr1db0.json
          key: cord-324021-y1vr1db0
      authors: Kozak, M.
        title: Determinants of translational fidelity and efficiency in vertebrate mRNAs
         date: 1994-12-31
      journal: Biochimie
          DOI: 10.1016/0300-9084(94)90182-1
          sha: 
       doc_id: 324021
     cord_uid: y1vr1db0

         file: cache/cord-001835-0s7ok4uw.json
          key: cord-001835-0s7ok4uw
      authors: nan
        title: Abstracts of the 29th Annual Symposium of The Protein Society
         date: 2015-10-01
      journal: Protein Science
          DOI: 10.1002/pro.2823
          sha: 
       doc_id: 1835
     cord_uid: 0s7ok4uw

         file: cache/cord-326225-crtpzad7.json
          key: cord-326225-crtpzad7
      authors: Neill, John D.; Bayles, Darrell O.; Ridpath, Julia F.
        title: Simultaneous rapid sequencing of multiple RNA virus genomes
         date: 2014-06-01
      journal: J Virol Methods
          DOI: 10.1016/j.jviromet.2014.02.016
          sha: 
       doc_id: 326225
     cord_uid: crtpzad7

         file: cache/cord-328644-odtue60a.json
          key: cord-328644-odtue60a
      authors: Comandatore, Francesco; Chiodi, Alice; Gabrieli, Paolo; Biffignandi, Gherard Batisti; Perini, Matteo; Ricagno, Stefano; Mascolo, Elia; Petazzoni, Greta; Ramazzotti, Matteo; Rimoldi, Sara Giordana; Gismondo, Maria Rita; Micheli, Valeria; Sassera, Davide; Gaiarsa, Stefano; Bandi, Claudio; Brilli, Matteo
        title: Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes
         date: 2020-05-28
      journal: bioRxiv
          DOI: 10.1101/2020.04.30.071027
          sha: 
       doc_id: 328644
     cord_uid: odtue60a

         file: cache/cord-334394-qgyzk7th.json
          key: cord-334394-qgyzk7th
      authors: Edgar, Robert C.; Taylor, Jeff; Altman, Tomer; Barbera, Pierre; Meleshko, Dmitry; Lin, Victor; Lohr, Dan; Novakovsky, Gherman; Al-Shayeb, Basem; Banfield, Jillian F.; Korobeynikov, Anton; Chikhi, Rayan; Babaian, Artem
        title: Petabase-scale sequence alignment catalyses viral discovery
         date: 2020-08-10
      journal: bioRxiv
          DOI: 10.1101/2020.08.07.241729
          sha: 
       doc_id: 334394
     cord_uid: qgyzk7th

         file: cache/cord-331698-rwow1ydx.json
          key: cord-331698-rwow1ydx
      authors: Latorre-Pérez, Adriel; Pascual, Javier; Porcar, Manuel; Vilanova, Cristina
        title: A lab in the field: applications of real-time, in situ metagenomic sequencing
         date: 2020-08-20
      journal: Biol Methods Protoc
          DOI: 10.1093/biomethods/bpaa016
          sha: 
       doc_id: 331698
     cord_uid: rwow1ydx

         file: cache/cord-330067-ujhgb3b0.json
          key: cord-330067-ujhgb3b0
      authors: Huang, Yi; Lau, Susanna K. P.; Woo, Patrick C. Y.; Yuen, Kwok-yung
        title: CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes
         date: 2007-10-02
      journal: Nucleic Acids Res
          DOI: 10.1093/nar/gkm754
          sha: 
       doc_id: 330067
     cord_uid: ujhgb3b0

         file: cache/cord-338207-60vrlrim.json
          key: cord-338207-60vrlrim
      authors: Lefkowitz, E.J.; Odom, M.R.; Upton, C.
        title: Virus Databases
         date: 2008-07-30
      journal: Encyclopedia of Virology
          DOI: 10.1016/b978-012374410-4.00719-6
          sha: 
       doc_id: 338207
     cord_uid: 60vrlrim

         file: cache/cord-339209-oe8onyr9.json
          key: cord-339209-oe8onyr9
      authors: Vasilakis, Nikos; Guzman, Hilda; Firth, Cadhla; Forrester, Naomi L; Widen, Steven G; Wood, Thomas G; Rossi, Shannan L; Ghedin, Elodie; Popov, Vsevolov; Blasdell, Kim R; Walker, Peter J; Tesh, Robert B
        title: Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range
         date: 2014-05-20
      journal: Virol J
          DOI: 10.1186/1743-422x-11-97
          sha: 
       doc_id: 339209
     cord_uid: oe8onyr9

         file: cache/cord-334127-wjf8t8vp.json
          key: cord-334127-wjf8t8vp
      authors: Brister, J. Rodney; Ako-adjei, Danso; Bao, Yiming; Blinkova, Olga
        title: NCBI Viral Genomes Resource
         date: 2015-01-28
      journal: Nucleic Acids Res
          DOI: 10.1093/nar/gku1207
          sha: 
       doc_id: 334127
     cord_uid: wjf8t8vp

         file: cache/cord-348427-worgd0xu.json
          key: cord-348427-worgd0xu
      authors: Hatcher, Eneida L.; Zhdanov, Sergey A.; Bao, Yiming; Blinkova, Olga; Nawrocki, Eric P.; Ostapchuck, Yuri; Schäffer, Alejandro A.; Brister, J. Rodney
        title: Virus Variation Resource – improved response to emergent viral outbreaks
         date: 2017-01-04
      journal: Nucleic Acids Res
          DOI: 10.1093/nar/gkw1065
          sha: 
       doc_id: 348427
     cord_uid: worgd0xu

         file: cache/cord-340907-j9i1wlak.json
          key: cord-340907-j9i1wlak
      authors: Zarai, Yoram; Zafrir, Zohar; Siridechadilok, Bunpote; Suphatrakul, Amporn; Roopin, Modi; Julander, Justin; Tuller, Tamir
        title: Evolutionary selection against short nucleotide sequences in viruses and their related hosts
         date: 2020-04-27
      journal: DNA Res
          DOI: 10.1093/dnares/dsaa008
          sha: 
       doc_id: 340907
     cord_uid: j9i1wlak

         file: cache/cord-341564-fvuwick5.json
          key: cord-341564-fvuwick5
      authors: Qi, Zhao-Hui; Li, Ke-Cheng; Ma, Jin-Long; Yao, Yu-Hua; Liu, Ling-Yun
        title: Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application
         date: 2018-06-12
      journal: Evol Bioinform Online
          DOI: 10.1177/1176934318777755
          sha: 
       doc_id: 341564
     cord_uid: fvuwick5

         file: cache/cord-345552-h6fwi0qn.json
          key: cord-345552-h6fwi0qn
      authors: Li, Q.-G.; Lindman, K.; Wadell, G.
        title: Hydropathic characteristics of adenovirus hexons
         date: 1997-07-01
      journal: Arch Virol
          DOI: 10.1007/s007050050162
          sha: 
       doc_id: 345552
     cord_uid: h6fwi0qn

         file: cache/cord-328259-3g4klpyg.json
          key: cord-328259-3g4klpyg
      authors: Guajardo-Leiva, Sergio; Chnaiderman, Jonás; Gaggero, Aldo; Díez, Beatriz
        title: Metagenomic Insights into the Sewage RNA Virosphere of a Large City
         date: 2020-09-21
      journal: Viruses
          DOI: 10.3390/v12091050
          sha: 
       doc_id: 328259
     cord_uid: 3g4klpyg

         file: cache/cord-330312-1pjolkql.json
          key: cord-330312-1pjolkql
      authors: Liu, Y.-T.
        title: Infectious Disease Genomics
         date: 2017-01-20
      journal: Genetics and Evolution of Infectious Diseases
          DOI: 10.1016/b978-0-12-799942-5.00010-x
          sha: 
       doc_id: 330312
     cord_uid: 1pjolkql

         file: cache/cord-354465-5nqrrnqr.json
          key: cord-354465-5nqrrnqr
      authors: Haslinger, Christian; Stadler, Peter F.
        title: RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties
         date: 1999
      journal: Bull Math Biol
          DOI: 10.1006/bulm.1998.0085
          sha: 
       doc_id: 354465
     cord_uid: 5nqrrnqr

         file: cache/cord-342785-55r01n0x.json
          key: cord-342785-55r01n0x
      authors: Lemmon, Gordon H; Gardner, Shea N
        title: Predicting the sensitivity and specificity of published real-time PCR assays
         date: 2008-09-25
      journal: Ann Clin Microbiol Antimicrob
          DOI: 10.1186/1476-0711-7-18
          sha: 
       doc_id: 342785
     cord_uid: 55r01n0x

         file: cache/cord-344782-ond1ziu5.json
          key: cord-344782-ond1ziu5
      authors: Zhang, Jing; Finlaison, Deborah S.; Frost, Melinda J.; Gestier, Sarah; Gu, Xingnian; Hall, Jane; Jenkins, Cheryl; Parrish, Kate; Read, Andrew J.; Srivastava, Mukesh; Rose, Karrie; Kirkland, Peter D.
        title: Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi)
         date: 2018-10-24
      journal: PLoS One
          DOI: 10.1371/journal.pone.0205209
          sha: 
       doc_id: 344782
     cord_uid: ond1ziu5

         file: cache/cord-339915-8j04y50s.json
          key: cord-339915-8j04y50s
      authors: Deng, Wei; Luan, Yihui
        title: DV-Curve Representation of Protein Sequences and Its Application
         date: 2014-05-08
      journal: Comput Math Methods Med
          DOI: 10.1155/2014/203871
          sha: 
       doc_id: 339915
     cord_uid: 8j04y50s

         file: cache/cord-355075-ieb35upi.json
          key: cord-355075-ieb35upi
      authors: Papenfuss, Anthony T; Baker, Michelle L; Feng, Zhi-Ping; Tachedjian, Mary; Crameri, Gary; Cowled, Chris; Ng, Justin; Janardhana, Vijaya; Field, Hume E; Wang, Lin-Fa
        title: The immune gene repertoire of an important viral reservoir, the Australian black flying fox
         date: 2012-06-20
      journal: BMC Genomics
          DOI: 10.1186/1471-2164-13-261
          sha: 
       doc_id: 355075
     cord_uid: ieb35upi

         file: cache/cord-353290-1wi1dhv6.json
          key: cord-353290-1wi1dhv6
      authors: Kustin, Talia; Stern, Adi
        title: Biased mutation and selection in RNA viruses
         date: 2020-09-28
      journal: Mol Biol Evol
          DOI: 10.1093/molbev/msaa247
          sha: 
       doc_id: 353290
     cord_uid: 1wi1dhv6

         file: cache/cord-343863-q1y8uscj.json
          key: cord-343863-q1y8uscj
      authors: Whitney, Joe; Esteban, David J; Upton, Chris
        title: Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches
         date: 2005-02-08
      journal: BMC Bioinformatics
          DOI: 10.1186/1471-2105-6-23
          sha: 
       doc_id: 343863
     cord_uid: q1y8uscj

         file: cache/cord-341879-vubszdp2.json
          key: cord-341879-vubszdp2
      authors: Li, Lucy M; Grassly, Nicholas C; Fraser, Christophe
        title: Genomic analysis of emerging pathogens: methods, application and future trends
         date: 2014-11-22
      journal: Genome Biol
          DOI: 10.1186/s13059-014-0541-9
          sha: 
       doc_id: 341879
     cord_uid: vubszdp2

Reading metadata file and updating bibliogrpahics
=== updating bibliographic database
Building study carrel named keyword-sequence-cord
=== file2bib.sh ===
Traceback (most recent call last):
  File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'cord-193356-hqbstgg7'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data-disk/reader-compute/reader-cord/bin/file2bib.py", line 64, in <module>
    if ( bibliographics.loc[ escape ,'author'] ) : author = bibliographics.loc[ escape,'author']
  File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1762, in __getitem__
    return self._getitem_tuple(key)
  File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1272, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1389, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1965, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 625, in _get_label
    return self.obj._xs(label, axis=axis)
  File "/data-disk/python/lib/python3.8/site-packages/pandas/core/generic.py", line 3537, in xs
    loc = self.index.get_loc(key)
  File "/data-disk/python/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'cord-193356-hqbstgg7'
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  3304 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  2864 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  4166 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  4256 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  3003 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  3868 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  5108 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  3945 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 96783 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  4500 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  1696 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  4970 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39:  4489 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint Try decreasing the value of OMP_NUM_THREADS.
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: line 39: 96250 Aborted                 $FILE2BIB "$FILE" > "$OUTPUT"
=== file2bib.sh ===
         id: cord-014674-ey29970v
     author: nan
      title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002
       date: 2003
      pages: 
  extension: .txt
        txt: ./txt/cord-014674-ey29970v.txt
      cache: ./cache/cord-014674-ey29970v.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-014674-ey29970v.txt'
=== file2bib.sh ===
         id: cord-253436-dz84icdc
     author: Wille, Michelle
      title: High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl
       date: 2016-03-03
      pages: 
  extension: .txt
        txt: ./txt/cord-253436-dz84icdc.txt
      cache: ./cache/cord-253436-dz84icdc.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-253436-dz84icdc.txt'
=== file2bib.sh ===
         id: cord-018459-isbc1r2o
     author: Munjal, Geetika
      title: Phylogenetics Algorithms and Applications
       date: 2018-12-10
      pages: 
  extension: .txt
        txt: ./txt/cord-018459-isbc1r2o.txt
      cache: ./cache/cord-018459-isbc1r2o.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-018459-isbc1r2o.txt'
=== file2bib.sh ===
         id: cord-001786-ybd8hi8y
     author: Dutilh, Bas E
      title: Metagenomic ventures into outer sequence space
       date: 2014-12-15
      pages: 
  extension: .txt
        txt: ./txt/cord-001786-ybd8hi8y.txt
      cache: ./cache/cord-001786-ybd8hi8y.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-001786-ybd8hi8y.txt'
=== file2bib.sh ===
         id: cord-012975-u87ol3fs
     author: Ogiwara, Atsushi
      title: Construction of a dictionary of sequence motifs that characterize groups of related proteins
       date: 1992-09-17
      pages: 
  extension: .txt
        txt: ./txt/cord-012975-u87ol3fs.txt
      cache: ./cache/cord-012975-u87ol3fs.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-012975-u87ol3fs.txt'
=== file2bib.sh ===
         id: cord-001340-kqcx7lrq
     author: Ladner, Jason T.
      title: Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing
       date: 2014-06-17
      pages: 
  extension: .txt
        txt: ./txt/cord-001340-kqcx7lrq.txt
      cache: ./cache/cord-001340-kqcx7lrq.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-001340-kqcx7lrq.txt'
=== file2bib.sh ===
         id: cord-255194-4i9fc0r7
     author: Djikeng, Appolinaire
      title: Viral genome sequencing by random priming methods
       date: 2008-01-07
      pages: 
  extension: .txt
        txt: ./txt/cord-255194-4i9fc0r7.txt
      cache: ./cache/cord-255194-4i9fc0r7.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-255194-4i9fc0r7.txt'
=== file2bib.sh ===
         id: cord-264135-s2u76pvk
     author: Patel, Amrutlal K.
      title: Complete genome sequence analysis of chicken astrovirus isolate from India
       date: 2016-12-23
      pages: 
  extension: .txt
        txt: ./txt/cord-264135-s2u76pvk.txt
      cache: ./cache/cord-264135-s2u76pvk.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-264135-s2u76pvk.txt'
=== file2bib.sh ===
         id: cord-027316-echxuw74
     author: Modarresi, Kourosh
      title: Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model
       date: 2020-05-22
      pages: 
  extension: .txt
        txt: ./txt/cord-027316-echxuw74.txt
      cache: ./cache/cord-027316-echxuw74.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-027316-echxuw74.txt'
=== file2bib.sh ===
         id: cord-256278-jvfjf7aw
     author: Feng, Jie
      title: New method for comparing DNA primary sequences based on a discrimination measure
       date: 2010-10-21
      pages: 
  extension: .txt
        txt: ./txt/cord-256278-jvfjf7aw.txt
      cache: ./cache/cord-256278-jvfjf7aw.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-256278-jvfjf7aw.txt'
=== file2bib.sh ===
         id: cord-005060-n901y2d4
     author: ZHANG, Feiyun
      title: Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus
       date: 2001
      pages: 
  extension: .txt
        txt: ./txt/cord-005060-n901y2d4.txt
      cache: ./cache/cord-005060-n901y2d4.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-005060-n901y2d4.txt'
=== file2bib.sh ===
         id: cord-010161-bcuec2fz
     author: Matson, David O.
      title: IV, 6. Calicivirus RNA recombination
       date: 2004-09-14
      pages: 
  extension: .txt
        txt: ./txt/cord-010161-bcuec2fz.txt
      cache: ./cache/cord-010161-bcuec2fz.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-010161-bcuec2fz.txt'
=== file2bib.sh ===
         id: cord-025610-7vouj8pp
     author: Latif, Seemab
      title: Backward-Forward Sequence Generative Network for Multiple Lexical Constraints
       date: 2020-05-06
      pages: 
  extension: .txt
        txt: ./txt/cord-025610-7vouj8pp.txt
      cache: ./cache/cord-025610-7vouj8pp.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-025610-7vouj8pp.txt'
=== file2bib.sh ===
         id: cord-001537-i34vmfpp
     author: Lima, Francisco Esmaile de Sales
      title: Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil
       date: 2015-02-17
      pages: 
  extension: .txt
        txt: ./txt/cord-001537-i34vmfpp.txt
      cache: ./cache/cord-001537-i34vmfpp.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-001537-i34vmfpp.txt'
=== file2bib.sh ===
         id: cord-023647-dlqs8ay9
     author: nan
      title: Sequences and topology
       date: 2003-03-21
      pages: 
  extension: .txt
        txt: ./txt/cord-023647-dlqs8ay9.txt
      cache: ./cache/cord-023647-dlqs8ay9.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-023647-dlqs8ay9.txt'
=== file2bib.sh ===
         id: cord-268549-2lg8i9r1
     author: Dai, Qi
      title: Sequence comparison via polar coordinates representation and curve tree
       date: 2012-01-07
      pages: 
  extension: .txt
        txt: ./txt/cord-268549-2lg8i9r1.txt
      cache: ./cache/cord-268549-2lg8i9r1.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-268549-2lg8i9r1.txt'
=== file2bib.sh ===
         id: cord-306725-0vam15pt
     author: Li, Hao
      title: First detection and genomic characteristics of bovine torovirus in dairy calves in China
       date: 2020-05-09
      pages: 
  extension: .txt
        txt: ./txt/cord-306725-0vam15pt.txt
      cache: ./cache/cord-306725-0vam15pt.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-306725-0vam15pt.txt'
=== file2bib.sh ===
         id: cord-266794-oyppubq5
     author: Zhang, Dachuan
      title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model
       date: 2020-09-01
      pages: 
  extension: .txt
        txt: ./txt/cord-266794-oyppubq5.txt
      cache: ./cache/cord-266794-oyppubq5.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-266794-oyppubq5.txt'
=== file2bib.sh ===
         id: cord-000473-jpow6iw1
     author: Astrovskaya, Irina
      title: Inferring viral quasispecies spectra from 454 pyrosequencing reads
       date: 2011-07-28
      pages: 
  extension: .txt
        txt: ./txt/cord-000473-jpow6iw1.txt
      cache: ./cache/cord-000473-jpow6iw1.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-000473-jpow6iw1.txt'
=== file2bib.sh ===
         id: cord-000257-ampip7od
     author: Bagowski, Christoph P
      title: The Nature of Protein Domain Evolution: Shaping the Interaction Network
       date: 2010-08-17
      pages: 
  extension: .txt
        txt: ./txt/cord-000257-ampip7od.txt
      cache: ./cache/cord-000257-ampip7od.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-000257-ampip7od.txt'
=== file2bib.sh ===
         id: cord-321150-ev6acl7b
     author: Lam, Ha Minh
      title: Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm
       date: 2017-10-03
      pages: 
  extension: .txt
        txt: ./txt/cord-321150-ev6acl7b.txt
      cache: ./cache/cord-321150-ev6acl7b.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-321150-ev6acl7b.txt'
=== file2bib.sh ===
         id: cord-265857-fs6dj3dp
     author: Liu, Yu-Tsueng
      title: Infectious Disease Genomics
       date: 2010-12-24
      pages: 
  extension: .txt
        txt: ./txt/cord-265857-fs6dj3dp.txt
      cache: ./cache/cord-265857-fs6dj3dp.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-265857-fs6dj3dp.txt'
=== file2bib.sh ===
         id: cord-017584-9rx4jlw8
     author: Kim, Kwangsoo
      title: Selecting Genotyping Oligo Probes Via Logical Analysis of Data
       date: 2007
      pages: 
  extension: .txt
        txt: ./txt/cord-017584-9rx4jlw8.txt
      cache: ./cache/cord-017584-9rx4jlw8.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-017584-9rx4jlw8.txt'
=== file2bib.sh ===
         id: cord-256608-ajzk86rq
     author: van Weezep, Erik
      title: PCR diagnostics: In silico validation by an automated tool using freely available software programs
       date: 2019-05-13
      pages: 
  extension: .txt
        txt: ./txt/cord-256608-ajzk86rq.txt
      cache: ./cache/cord-256608-ajzk86rq.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-256608-ajzk86rq.txt'
=== file2bib.sh ===
         id: cord-203232-1nnqx1g9
     author: Canturk, Semih
      title: Machine-Learning Driven Drug Repurposing for COVID-19
       date: 2020-06-25
      pages: 
  extension: .txt
        txt: ./txt/cord-203232-1nnqx1g9.txt
      cache: ./cache/cord-203232-1nnqx1g9.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-203232-1nnqx1g9.txt'
=== file2bib.sh ===
         id: cord-266288-buc4dd5y
     author: Dong, Rui
      title: A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance
       date: 2019-04-09
      pages: 
  extension: .txt
        txt: ./txt/cord-266288-buc4dd5y.txt
      cache: ./cache/cord-266288-buc4dd5y.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-266288-buc4dd5y.txt'
=== file2bib.sh ===
         id: cord-004862-yv76yvy5
     author: Demers, G. William
      title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin
       date: 1989
      pages: 
  extension: .txt
        txt: ./txt/cord-004862-yv76yvy5.txt
      cache: ./cache/cord-004862-yv76yvy5.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-004862-yv76yvy5.txt'
=== file2bib.sh ===
         id: cord-255371-o9oxchq6
     author: Nguyen, Thanh Thi
      title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus)
       date: 2020-07-10
      pages: 
  extension: .txt
        txt: ./txt/cord-255371-o9oxchq6.txt
      cache: ./cache/cord-255371-o9oxchq6.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-255371-o9oxchq6.txt'
=== file2bib.sh ===
         id: cord-266960-kyx6xhvj
     author: Temple, Mark D.
      title: Real-time audio and visual display of the Coronavirus genome
       date: 2020-10-02
      pages: 
  extension: .txt
        txt: ./txt/cord-266960-kyx6xhvj.txt
      cache: ./cache/cord-266960-kyx6xhvj.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-266960-kyx6xhvj.txt'
=== file2bib.sh ===
         id: cord-002473-2kpxhzbe
     author: Das, Jayanta Kumar
      title: Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach
       date: 2017-03-31
      pages: 
  extension: .txt
        txt: ./txt/cord-002473-2kpxhzbe.txt
      cache: ./cache/cord-002473-2kpxhzbe.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-002473-2kpxhzbe.txt'
=== file2bib.sh ===
         id: cord-311240-o0zyt2vb
     author: Motayo, Babatunde Olarenwaju
      title: Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences
       date: 2020-07-27
      pages: 
  extension: .txt
        txt: ./txt/cord-311240-o0zyt2vb.txt
      cache: ./cache/cord-311240-o0zyt2vb.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-311240-o0zyt2vb.txt'
=== file2bib.sh ===
         id: cord-213136-euv6pqh5
     author: Singh, Kulveer
      title: Sequence Effects on Internal Structure of Droplets of Associative Polymers
       date: 2020-05-17
      pages: 
  extension: .txt
        txt: ./txt/cord-213136-euv6pqh5.txt
      cache: ./cache/cord-213136-euv6pqh5.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-213136-euv6pqh5.txt'
=== file2bib.sh ===
         id: cord-014461-2ubh9u8r
     author: Nelson, Oranmiyan W.
      title: Genome sequences published outside of Standards in Genomic Sciences, July - October 2012
       date: 2012-10-10
      pages: 
  extension: .txt
        txt: ./txt/cord-014461-2ubh9u8r.txt
      cache: ./cache/cord-014461-2ubh9u8r.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-014461-2ubh9u8r.txt'
=== file2bib.sh ===
         id: cord-264296-0x90yubt
     author: Sawmya, Shashata
      title: Analyzing hCov genome sequences: Applying Machine Intelligence and beyond
       date: 2020-06-03
      pages: 
  extension: .txt
        txt: ./txt/cord-264296-0x90yubt.txt
      cache: ./cache/cord-264296-0x90yubt.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-264296-0x90yubt.txt'
=== file2bib.sh ===
         id: cord-010499-yefxrj30
     author: Yelverton, Elizabeth
      title: The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli
       date: 2006-10-27
      pages: 
  extension: .txt
        txt: ./txt/cord-010499-yefxrj30.txt
      cache: ./cache/cord-010499-yefxrj30.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-010499-yefxrj30.txt'
=== file2bib.sh ===
         id: cord-287634-64zqe4cz
     author: Al-Ssulami, Abdulrakeeb M.
      title: CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents
       date: 2020-01-31
      pages: 
  extension: .txt
        txt: ./txt/cord-287634-64zqe4cz.txt
      cache: ./cache/cord-287634-64zqe4cz.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-287634-64zqe4cz.txt'
=== file2bib.sh ===
         id: cord-279528-41atidai
     author: Abo-Elkhier, Mervat M.
      title: Measuring Similarity among Protein Sequences Using a New Descriptor
       date: 2019-11-22
      pages: 
  extension: .txt
        txt: ./txt/cord-279528-41atidai.txt
      cache: ./cache/cord-279528-41atidai.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-279528-41atidai.txt'
=== file2bib.sh ===
         id: cord-102766-n6mpdhyu
     author: Alam, Md. Nafis Ul
      title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses
       date: 2020-06-25
      pages: 
  extension: .txt
        txt: ./txt/cord-102766-n6mpdhyu.txt
      cache: ./cache/cord-102766-n6mpdhyu.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-102766-n6mpdhyu.txt'
=== file2bib.sh ===
         id: cord-280881-5o38ihe0
     author: Wlodawer, Alexander
      title: A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases
       date: 2003-11-11
      pages: 
  extension: .txt
        txt: ./txt/cord-280881-5o38ihe0.txt
      cache: ./cache/cord-280881-5o38ihe0.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-280881-5o38ihe0.txt'
=== file2bib.sh ===
         id: cord-003316-r5te5xob
     author: Balloux, Francois
      title: From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic
       date: 2018-12-17
      pages: 
  extension: .txt
        txt: ./txt/cord-003316-r5te5xob.txt
      cache: ./cache/cord-003316-r5te5xob.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-003316-r5te5xob.txt'
=== file2bib.sh ===
         id: cord-000642-mkwpuav6
     author: Moreira, Rebeca
      title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing
       date: 2012-04-19
      pages: 
  extension: .txt
        txt: ./txt/cord-000642-mkwpuav6.txt
      cache: ./cache/cord-000642-mkwpuav6.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-000642-mkwpuav6.txt'
=== file2bib.sh ===
         id: cord-287658-c2lljdi7
     author: Lopez-Rincon, Alejandro
      title: Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning
       date: 2020-09-10
      pages: 
  extension: .txt
        txt: ./txt/cord-287658-c2lljdi7.txt
      cache: ./cache/cord-287658-c2lljdi7.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-287658-c2lljdi7.txt'
=== file2bib.sh ===
         id: cord-302798-q0mbngqy
     author: Ge, Junwei
      title: Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China
       date: 2018-06-14
      pages: 
  extension: .txt
        txt: ./txt/cord-302798-q0mbngqy.txt
      cache: ./cache/cord-302798-q0mbngqy.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-302798-q0mbngqy.txt'
=== file2bib.sh ===
         id: cord-321386-u1imic5l
     author: Li, Chun
      title: Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
       date: 2018-02-17
      pages: 
  extension: .txt
        txt: ./txt/cord-321386-u1imic5l.txt
      cache: ./cache/cord-321386-u1imic5l.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-321386-u1imic5l.txt'
=== file2bib.sh ===
         id: cord-274056-9t3kneoo
     author: Abd Elwahaab, Marwa A.
      title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector
       date: 2019-05-08
      pages: 
  extension: .txt
        txt: ./txt/cord-274056-9t3kneoo.txt
      cache: ./cache/cord-274056-9t3kneoo.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-274056-9t3kneoo.txt'
=== file2bib.sh ===
         id: cord-193910-7p3f3znj
     author: Zhang, Xiangxie
      title: Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification
       date: 2020-11-01
      pages: 
  extension: .txt
        txt: ./txt/cord-193910-7p3f3znj.txt
      cache: ./cache/cord-193910-7p3f3znj.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-193910-7p3f3znj.txt'
=== file2bib.sh ===
         id: cord-016798-tv2ntug6
     author: Gautam, Ablesh
      title: Bioinformatics Applications in Advancing Animal Virus Research
       date: 2019-06-06
      pages: 
  extension: .txt
        txt: ./txt/cord-016798-tv2ntug6.txt
      cache: ./cache/cord-016798-tv2ntug6.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-016798-tv2ntug6.txt'
=== file2bib.sh ===
         id: cord-252347-vnn4135b
     author: Lee, Wai-Ming
      title: A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants
       date: 2007-10-03
      pages: 
  extension: .txt
        txt: ./txt/cord-252347-vnn4135b.txt
      cache: ./cache/cord-252347-vnn4135b.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-252347-vnn4135b.txt'
=== file2bib.sh ===
         id: cord-001974-wjf3c7a7
     author: Friis-Nielsen, Jens
      title: Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
       date: 2016-02-19
      pages: 
  extension: .txt
        txt: ./txt/cord-001974-wjf3c7a7.txt
      cache: ./cache/cord-001974-wjf3c7a7.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-001974-wjf3c7a7.txt'
=== file2bib.sh ===
         id: cord-025948-6dsx7pey
     author: Maitra, Arindam
      title: Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility
       date: 2020-06-04
      pages: 
  extension: .txt
        txt: ./txt/cord-025948-6dsx7pey.txt
      cache: ./cache/cord-025948-6dsx7pey.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-025948-6dsx7pey.txt'
=== file2bib.sh ===
         id: cord-268467-btfz6ye8
     author: Schreiber, Steven S.
      title: Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E
       date: 1989-03-31
      pages: 
  extension: .txt
        txt: ./txt/cord-268467-btfz6ye8.txt
      cache: ./cache/cord-268467-btfz6ye8.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-268467-btfz6ye8.txt'
=== file2bib.sh ===
         id: cord-300149-djclli8n
     author: Ruan, Yijun
      title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection
       date: 2003-05-24
      pages: 
  extension: .txt
        txt: ./txt/cord-300149-djclli8n.txt
      cache: ./cache/cord-300149-djclli8n.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-300149-djclli8n.txt'
=== file2bib.sh ===
         id: cord-267500-x3u9i1vq
     author: Rose, Rebecca
      title: Challenges in the analysis of viral metagenomes
       date: 2016-08-03
      pages: 
  extension: .txt
        txt: ./txt/cord-267500-x3u9i1vq.txt
      cache: ./cache/cord-267500-x3u9i1vq.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-267500-x3u9i1vq.txt'
=== file2bib.sh ===
         id: cord-035033-osjy88rc
     author: Aydin, Berkay
      title: Spatiotemporal event sequence discovery without thresholds
       date: 2020-11-09
      pages: 
  extension: .txt
        txt: ./txt/cord-035033-osjy88rc.txt
      cache: ./cache/cord-035033-osjy88rc.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-035033-osjy88rc.txt'
=== file2bib.sh ===
         id: cord-324216-ce3wa889
     author: Wang, Zheng
      title: Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses
       date: 2008-12-01
      pages: 
  extension: .txt
        txt: ./txt/cord-324216-ce3wa889.txt
      cache: ./cache/cord-324216-ce3wa889.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-324216-ce3wa889.txt'
=== file2bib.sh ===
         id: cord-033010-o5kiadfm
     author: Durojaye, Olanrewaju Ayodeji
      title: Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study
       date: 2020-10-02
      pages: 
  extension: .txt
        txt: ./txt/cord-033010-o5kiadfm.txt
      cache: ./cache/cord-033010-o5kiadfm.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-033010-o5kiadfm.txt'
=== file2bib.sh ===
         id: cord-300796-rmjv56ia
     author: nan
      title: The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation
       date: 1990-09-01
      pages: 
  extension: .txt
        txt: ./txt/cord-300796-rmjv56ia.txt
      cache: ./cache/cord-300796-rmjv56ia.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-300796-rmjv56ia.txt'
=== file2bib.sh ===
         id: cord-015850-ef6svn8f
     author: Saitou, Naruya
      title: Eukaryote Genomes
       date: 2013-08-22
      pages: 
  extension: .txt
        txt: ./txt/cord-015850-ef6svn8f.txt
      cache: ./cache/cord-015850-ef6svn8f.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-015850-ef6svn8f.txt'
=== file2bib.sh ===
         id: cord-325985-xfzhn1n1
     author: Jabado, Omar J.
      title: Comprehensive viral oligonucleotide probe design using conserved protein regions
       date: 2007-12-13
      pages: 
  extension: .txt
        txt: ./txt/cord-325985-xfzhn1n1.txt
      cache: ./cache/cord-325985-xfzhn1n1.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-325985-xfzhn1n1.txt'
=== file2bib.sh ===
         id: cord-275258-azpg5yrh
     author: Mead, Dylan J.T.
      title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling
       date: 2019-07-26
      pages: 
  extension: .txt
        txt: ./txt/cord-275258-azpg5yrh.txt
      cache: ./cache/cord-275258-azpg5yrh.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-275258-azpg5yrh.txt'
=== file2bib.sh ===
         id: cord-263987-ff6kor0c
     author: Holmes, Ian H.
      title: Solving the master equation for Indels
       date: 2017-05-12
      pages: 
  extension: .txt
        txt: ./txt/cord-263987-ff6kor0c.txt
      cache: ./cache/cord-263987-ff6kor0c.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-263987-ff6kor0c.txt'
=== file2bib.sh ===
         id: cord-022494-d66rz6dc
     author: Webb, B.
      title: Comparative Modeling of Drug Target Proteins
       date: 2014-10-01
      pages: 
  extension: .txt
        txt: ./txt/cord-022494-d66rz6dc.txt
      cache: ./cache/cord-022494-d66rz6dc.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-022494-d66rz6dc.txt'
=== file2bib.sh ===
         id: cord-010273-0c56x9f5
     author: Simmonds, Peter
      title: Virology of hepatitis C virus
       date: 2001-10-10
      pages: 
  extension: .txt
        txt: ./txt/cord-010273-0c56x9f5.txt
      cache: ./cache/cord-010273-0c56x9f5.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-010273-0c56x9f5.txt'
=== file2bib.sh ===
         id: cord-103029-nc5yf6x4
     author: Wichmann, Stefan
      title: Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank
       date: 2020-09-25
      pages: 
  extension: .txt
        txt: ./txt/cord-103029-nc5yf6x4.txt
      cache: ./cache/cord-103029-nc5yf6x4.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-103029-nc5yf6x4.txt'
=== file2bib.sh ===
         id: cord-017932-vmtjc8ct
     author: Georgiev, Vassil St.
      title: Genomic and Postgenomic Research
       date: 2009
      pages: 
  extension: .txt
        txt: ./txt/cord-017932-vmtjc8ct.txt
      cache: ./cache/cord-017932-vmtjc8ct.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-017932-vmtjc8ct.txt'
=== file2bib.sh ===
         id: cord-018133-2otxft31
     author: Altman, Russ B.
      title: Bioinformatics
       date: 2006
      pages: 
  extension: .txt
        txt: ./txt/cord-018133-2otxft31.txt
      cache: ./cache/cord-018133-2otxft31.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-018133-2otxft31.txt'
=== file2bib.sh ===
         id: cord-321715-bkfkmtld
     author: Redelings, Benjamin D
      title: Incorporating indel information into phylogeny estimation for rapidly emerging pathogens
       date: 2007-03-14
      pages: 
  extension: .txt
        txt: ./txt/cord-321715-bkfkmtld.txt
      cache: ./cache/cord-321715-bkfkmtld.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-321715-bkfkmtld.txt'
=== file2bib.sh ===
         id: cord-311839-61djk4bs
     author: Wei, Dan
      title: A novel hierarchical clustering algorithm for gene sequences
       date: 2012-07-23
      pages: 
  extension: .txt
        txt: ./txt/cord-311839-61djk4bs.txt
      cache: ./cache/cord-311839-61djk4bs.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-311839-61djk4bs.txt'
=== file2bib.sh ===
         id: cord-326225-crtpzad7
     author: Neill, John D.
      title: Simultaneous rapid sequencing of multiple RNA virus genomes
       date: 2014-06-01
      pages: 
  extension: .txt
        txt: ./txt/cord-326225-crtpzad7.txt
      cache: ./cache/cord-326225-crtpzad7.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-326225-crtpzad7.txt'
=== file2bib.sh ===
         id: cord-345552-h6fwi0qn
     author: Li, Q.-G.
      title: Hydropathic characteristics of adenovirus hexons
       date: 1997-07-01
      pages: 
  extension: .txt
        txt: ./txt/cord-345552-h6fwi0qn.txt
      cache: ./cache/cord-345552-h6fwi0qn.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-345552-h6fwi0qn.txt'
=== file2bib.sh ===
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: fork: retry: No child processes
         id: cord-330067-ujhgb3b0
     author: Huang, Yi
      title: CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes
       date: 2007-10-02
      pages: 
  extension: .txt
        txt: ./txt/cord-330067-ujhgb3b0.txt
      cache: ./cache/cord-330067-ujhgb3b0.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-330067-ujhgb3b0.txt'
=== file2bib.sh ===
         id: cord-264746-gfn312aa
     author: Muse, Spencer
      title: GENOMICS AND BIOINFORMATICS
       date: 2012-03-29
      pages: 
  extension: .txt
        txt: ./txt/cord-264746-gfn312aa.txt
      cache: ./cache/cord-264746-gfn312aa.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-264746-gfn312aa.txt'
=== file2bib.sh ===
         id: cord-341564-fvuwick5
     author: Qi, Zhao-Hui
      title: Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application
       date: 2018-06-12
      pages: 
  extension: .txt
        txt: ./txt/cord-341564-fvuwick5.txt
      cache: ./cache/cord-341564-fvuwick5.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-341564-fvuwick5.txt'
=== file2bib.sh ===
         id: cord-334127-wjf8t8vp
     author: Brister, J. Rodney
      title: NCBI Viral Genomes Resource
       date: 2015-01-28
      pages: 
  extension: .txt
        txt: ./txt/cord-334127-wjf8t8vp.txt
      cache: ./cache/cord-334127-wjf8t8vp.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-334127-wjf8t8vp.txt'
=== file2bib.sh ===
         id: cord-339209-oe8onyr9
     author: Vasilakis, Nikos
      title: Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range
       date: 2014-05-20
      pages: 
  extension: .txt
        txt: ./txt/cord-339209-oe8onyr9.txt
      cache: ./cache/cord-339209-oe8onyr9.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-339209-oe8onyr9.txt'
=== file2bib.sh ===
         id: cord-022348-w7z97wir
     author: Sola, Monica
      title: Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing?
       date: 2007-09-02
      pages: 
  extension: .txt
        txt: ./txt/cord-022348-w7z97wir.txt
      cache: ./cache/cord-022348-w7z97wir.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-022348-w7z97wir.txt'
=== file2bib.sh ===
         id: cord-342785-55r01n0x
     author: Lemmon, Gordon H
      title: Predicting the sensitivity and specificity of published real-time PCR assays
       date: 2008-09-25
      pages: 
  extension: .txt
        txt: ./txt/cord-342785-55r01n0x.txt
      cache: ./cache/cord-342785-55r01n0x.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-342785-55r01n0x.txt'
=== file2bib.sh ===
         id: cord-016594-lj0us1dq
     author: Flower, Darren R.
      title: Identification of Candidate Vaccine Antigens In Silico
       date: 2012-09-28
      pages: 
  extension: .txt
        txt: ./txt/cord-016594-lj0us1dq.txt
      cache: ./cache/cord-016594-lj0us1dq.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-016594-lj0us1dq.txt'
=== file2bib.sh ===
         id: cord-348427-worgd0xu
     author: Hatcher, Eneida L.
      title: Virus Variation Resource – improved response to emergent viral outbreaks
       date: 2017-01-04
      pages: 
  extension: .txt
        txt: ./txt/cord-348427-worgd0xu.txt
      cache: ./cache/cord-348427-worgd0xu.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-348427-worgd0xu.txt'
=== file2bib.sh ===
         id: cord-017354-cndb031c
     author: Janies, D.
      title: Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases
       date: 2008
      pages: 
  extension: .txt
        txt: ./txt/cord-017354-cndb031c.txt
      cache: ./cache/cord-017354-cndb031c.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-017354-cndb031c.txt'
=== file2bib.sh ===
         id: cord-339915-8j04y50s
     author: Deng, Wei
      title: DV-Curve Representation of Protein Sequences and Its Application
       date: 2014-05-08
      pages: 
  extension: .txt
        txt: ./txt/cord-339915-8j04y50s.txt
      cache: ./cache/cord-339915-8j04y50s.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-339915-8j04y50s.txt'
=== file2bib.sh ===
         id: cord-328644-odtue60a
     author: Comandatore, Francesco
      title: Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes
       date: 2020-05-28
      pages: 
  extension: .txt
        txt: ./txt/cord-328644-odtue60a.txt
      cache: ./cache/cord-328644-odtue60a.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-328644-odtue60a.txt'
=== file2bib.sh ===
         id: cord-331698-rwow1ydx
     author: Latorre-Pérez, Adriel
      title: A lab in the field: applications of real-time, in situ metagenomic sequencing
       date: 2020-08-20
      pages: 
  extension: .txt
        txt: ./txt/cord-331698-rwow1ydx.txt
      cache: ./cache/cord-331698-rwow1ydx.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-331698-rwow1ydx.txt'
=== file2bib.sh ===
         id: cord-330312-1pjolkql
     author: Liu, Y.-T.
      title: Infectious Disease Genomics
       date: 2017-01-20
      pages: 
  extension: .txt
        txt: ./txt/cord-330312-1pjolkql.txt
      cache: ./cache/cord-330312-1pjolkql.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-330312-1pjolkql.txt'
=== file2bib.sh ===
         id: cord-341879-vubszdp2
     author: Li, Lucy M
      title: Genomic analysis of emerging pathogens: methods, application and future trends
       date: 2014-11-22
      pages: 
  extension: .txt
        txt: ./txt/cord-341879-vubszdp2.txt
      cache: ./cache/cord-341879-vubszdp2.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-341879-vubszdp2.txt'
=== file2bib.sh ===
/data-disk/reader-compute/reader-cord/bin/file2bib.sh: fork: retry: No child processes
         id: cord-338207-60vrlrim
     author: Lefkowitz, E.J.
      title: Virus Databases
       date: 2008-07-30
      pages: 
  extension: .txt
        txt: ./txt/cord-338207-60vrlrim.txt
      cache: ./cache/cord-338207-60vrlrim.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-338207-60vrlrim.txt'
=== file2bib.sh ===
         id: cord-343863-q1y8uscj
     author: Whitney, Joe
      title: Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches
       date: 2005-02-08
      pages: 
  extension: .txt
        txt: ./txt/cord-343863-q1y8uscj.txt
      cache: ./cache/cord-343863-q1y8uscj.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-343863-q1y8uscj.txt'
=== file2bib.sh ===
         id: cord-011565-8ncgldaq
     author: Elworth, R A Leo
      title: To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics
       date: 2020-06-04
      pages: 
  extension: .txt
        txt: ./txt/cord-011565-8ncgldaq.txt
      cache: ./cache/cord-011565-8ncgldaq.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-011565-8ncgldaq.txt'
=== file2bib.sh ===
         id: cord-328259-3g4klpyg
     author: Guajardo-Leiva, Sergio
      title: Metagenomic Insights into the Sewage RNA Virosphere of a Large City
       date: 2020-09-21
      pages: 
  extension: .txt
        txt: ./txt/cord-328259-3g4klpyg.txt
      cache: ./cache/cord-328259-3g4klpyg.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-328259-3g4klpyg.txt'
=== file2bib.sh ===
         id: cord-340907-j9i1wlak
     author: Zarai, Yoram
      title: Evolutionary selection against short nucleotide sequences in viruses and their related hosts
       date: 2020-04-27
      pages: 
  extension: .txt
        txt: ./txt/cord-340907-j9i1wlak.txt
      cache: ./cache/cord-340907-j9i1wlak.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-340907-j9i1wlak.txt'
=== file2bib.sh ===
         id: cord-103297-4stnx8dw
     author: Widrich, Michael
      title: Modern Hopfield Networks and Attention for Immune Repertoire Classification
       date: 2020-08-17
      pages: 
  extension: .txt
        txt: ./txt/cord-103297-4stnx8dw.txt
      cache: ./cache/cord-103297-4stnx8dw.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-103297-4stnx8dw.txt'
=== file2bib.sh ===
         id: cord-334394-qgyzk7th
     author: Edgar, Robert C.
      title: Petabase-scale sequence alignment catalyses viral discovery
       date: 2020-08-10
      pages: 
  extension: .txt
        txt: ./txt/cord-334394-qgyzk7th.txt
      cache: ./cache/cord-334394-qgyzk7th.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	5
resourceName	b'cord-334394-qgyzk7th.txt'
=== file2bib.sh ===
         id: cord-018963-2lia97db
     author: Xu, Ying
      title: Protein Structure Prediction by Protein Threading
       date: 2010-04-29
      pages: 
  extension: .txt
        txt: ./txt/cord-018963-2lia97db.txt
      cache: ./cache/cord-018963-2lia97db.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-018963-2lia97db.txt'
=== file2bib.sh ===
         id: cord-344782-ond1ziu5
     author: Zhang, Jing
      title: Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi)
       date: 2018-10-24
      pages: 
  extension: .txt
        txt: ./txt/cord-344782-ond1ziu5.txt
      cache: ./cache/cord-344782-ond1ziu5.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-344782-ond1ziu5.txt'
=== file2bib.sh ===
         id: cord-353290-1wi1dhv6
     author: Kustin, Talia
      title: Biased mutation and selection in RNA viruses
       date: 2020-09-28
      pages: 
  extension: .txt
        txt: ./txt/cord-353290-1wi1dhv6.txt
      cache: ./cache/cord-353290-1wi1dhv6.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	2
resourceName	b'cord-353290-1wi1dhv6.txt'
=== file2bib.sh ===
         id: cord-355075-ieb35upi
     author: Papenfuss, Anthony T
      title: The immune gene repertoire of an important viral reservoir, the Australian black flying fox
       date: 2012-06-20
      pages: 
  extension: .txt
        txt: ./txt/cord-355075-ieb35upi.txt
      cache: ./cache/cord-355075-ieb35upi.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-355075-ieb35upi.txt'
=== file2bib.sh ===
         id: cord-354465-5nqrrnqr
     author: Haslinger, Christian
      title: RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties
       date: 1999
      pages: 
  extension: .txt
        txt: ./txt/cord-354465-5nqrrnqr.txt
      cache: ./cache/cord-354465-5nqrrnqr.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-354465-5nqrrnqr.txt'
=== file2bib.sh ===
         id: cord-304869-l6a68tqn
     author: Bielińska-Wąż, Dorota
      title: Graphical and numerical representations of DNA sequences: statistical aspects of similarity
       date: 2011-08-28
      pages: 
  extension: .txt
        txt: ./txt/cord-304869-l6a68tqn.txt
      cache: ./cache/cord-304869-l6a68tqn.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-304869-l6a68tqn.txt'
=== file2bib.sh ===
         id: cord-301827-a7hnuxy5
     author: Uversky, Vladimir N
      title: A decade and a half of protein intrinsic disorder: Biology still waits for physics
       date: 2013-04-29
      pages: 
  extension: .txt
        txt: ./txt/cord-301827-a7hnuxy5.txt
      cache: ./cache/cord-301827-a7hnuxy5.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	3
resourceName	b'cord-301827-a7hnuxy5.txt'
=== file2bib.sh ===
         id: cord-014462-11ggaqf1
     author: nan
      title: Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh
       date: 2011-04-21
      pages: 
  extension: .txt
        txt: ./txt/cord-014462-11ggaqf1.txt
      cache: ./cache/cord-014462-11ggaqf1.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	5
resourceName	b'cord-014462-11ggaqf1.txt'
=== file2bib.sh ===
         id: cord-023208-w99gc5nx
     author: nan
      title: Poster Presentation Abstracts
       date: 2006-09-01
      pages: 
  extension: .txt
        txt: ./txt/cord-023208-w99gc5nx.txt
      cache: ./cache/cord-023208-w99gc5nx.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	6
resourceName	b'cord-023208-w99gc5nx.txt'
=== file2bib.sh ===
         id: cord-004879-pgyzluwp
     author: nan
      title: Programmed cell death
       date: 1994
      pages: 
  extension: .txt
        txt: ./txt/cord-004879-pgyzluwp.txt
      cache: ./cache/cord-004879-pgyzluwp.txt

Content-Encoding	ISO-8859-1
Content-Type	text/plain; charset=ISO-8859-1
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	4
resourceName	b'cord-004879-pgyzluwp.txt'
=== file2bib.sh ===
         id: cord-023209-un2ysc2v
     author: nan
      title: Poster Presentations
       date: 2008-10-07
      pages: 
  extension: .txt
        txt: ./txt/cord-023209-un2ysc2v.txt
      cache: ./cache/cord-023209-un2ysc2v.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	8
resourceName	b'cord-023209-un2ysc2v.txt'
=== file2bib.sh ===
         id: cord-001835-0s7ok4uw
     author: nan
      title: Abstracts of the 29th Annual Symposium of The Protein Society
       date: 2015-10-01
      pages: 
  extension: .txt
        txt: ./txt/cord-001835-0s7ok4uw.txt
      cache: ./cache/cord-001835-0s7ok4uw.txt

Content-Encoding	UTF-8
Content-Type	text/plain; charset=UTF-8
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	10
resourceName	b'cord-001835-0s7ok4uw.txt'
Que is empty; done
keyword-sequence-cord
=== reduce.pl bib ===
         id = cord-000257-ampip7od
     author = Bagowski, Christoph P
      title = The Nature of Protein Domain Evolution: Shaping the Interaction Network
       date = 2010-08-17
      pages = 
  extension = .txt
       mime = text/plain
      words = 4678
  sentences = 249
     flesch = 43
    summary = With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. This likely stems from the fact that they are required to participate in many different interactions, which makes selection pressures more stringent and the appearance of the branches on phylogenetic trees relatively short and more difficult to assess when co-evolutionary data in terms of other domains in the same gene family or expression patterns is limited [42, 63] . This approach thus primarily focuses on the similarity and differences of the orthologous genes within network, and is therefore ideally suited for the study of protein domain evolution and has already revealed that species-specific parts Fig.
      cache = ./cache/cord-000257-ampip7od.txt
       txt  = ./txt/cord-000257-ampip7od.txt
=== reduce.pl bib ===
=== reduce.pl bib ===
         id = cord-016798-tv2ntug6
     author = Gautam, Ablesh
      title = Bioinformatics Applications in Advancing Animal Virus Research
       date = 2019-06-06
      pages = 
  extension = .txt
       mime = text/plain
      words = 6978
  sentences = 405
     flesch = 44
    summary = The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al.
      cache = ./cache/cord-016798-tv2ntug6.txt
       txt  = ./txt/cord-016798-tv2ntug6.txt
=== reduce.pl bib ===
         id = cord-000473-jpow6iw1
     author = Astrovskaya, Irina
      title = Inferring viral quasispecies spectra from 454 pyrosequencing reads
       date = 2011-07-28
      pages = 
  extension = .txt
       mime = text/plain
      words = 5363
  sentences = 296
     flesch = 54
    summary = High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Given a collection of 454 pyrosequencing reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population.
      cache = ./cache/cord-000473-jpow6iw1.txt
       txt  = ./txt/cord-000473-jpow6iw1.txt
=== reduce.pl bib ===
         id = cord-025610-7vouj8pp
     author = Latif, Seemab
      title = Backward-Forward Sequence Generative Network for Multiple Lexical Constraints
       date = 2020-05-06
      pages = 
  extension = .txt
       mime = text/plain
      words = 3923
  sentences = 230
     flesch = 50
    summary = In this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. Recently, Recurrent Neural Networks (RNNs) and their variants such as Long Short Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs) based language models have shown promising results in generating high quality text sequences, especially when the input and output are of variable length. first proposed multiple variants of Backward and Forward (B/F) language models based on GRUs for constrained sentence generation [13] . Therefore, we have proposed a neural probabilistic Backward-Forward architecture that can generate high quality sequences, with word embedding substitution method to satisfy multiple constraints. In this paper, we have proposed a novel method, dubbed Neural Probabilistic Backward-Forward language model and word embedding substitution method to address the issue of lexical constrained sequence generation.
      cache = ./cache/cord-025610-7vouj8pp.txt
       txt  = ./txt/cord-025610-7vouj8pp.txt
=== reduce.pl bib ===
         id = cord-004862-yv76yvy5
     author = Demers, G. William
      title = The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin
       date = 1989
      pages = 
  extension = .txt
       mime = text/plain
      words = 6659
  sentences = 347
     flesch = 62
    summary = title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. However, the region between the two ORFs is not conserved among species, and this observation is used to indicate possible start and stop codons for the ORFs. ORF-1 encodes a composite protein, and the 5' half of ORF-1 from L1Oc is related to type II cytoskeletal keratin. The dot-plot analyses in Fig. 6 show that the internal sequence of L1Oc is very similar to both L1Md (mouse) and L1Hs (human) over very long segments, whereas the 5' and 3' ends are not conserved between species.
      cache = ./cache/cord-004862-yv76yvy5.txt
       txt  = ./txt/cord-004862-yv76yvy5.txt
=== reduce.pl bib ===
         id = cord-025948-6dsx7pey
     author = Maitra, Arindam
      title = Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility
       date = 2020-06-04
      pages = 
  extension = .txt
       mime = text/plain
      words = 7218
  sentences = 382
     flesch = 56
    summary = Direct massively parallel sequencing of SARS-CoV-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in Eastern India. We have initiated a study on sequencing of SARS-CoV-2 genome from swab samples obtained from infected individuals from different regions of West Bengal in Eastern India and report here the first nine sequences and the results of analysis of the sequence data with respect to other sequences reported from the country until date. The A2a clade is characterized by the signature nonsynonymous mutations leading to amino acid changes of P323L in the RdRp which is involved in replication of the viral genome and the change of D614G in the Spike glycoprotein which is essential for the entry of the virus in the host cell by binding to the ACE2 receptor. We have also detected emergence of mutations in the important regions of the viral genome including Spike, RdRP and nucleocapsid coding genes.
      cache = ./cache/cord-025948-6dsx7pey.txt
       txt  = ./txt/cord-025948-6dsx7pey.txt
=== reduce.pl bib ===
         id = cord-014674-ey29970v
     author = nan
      title = Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002
       date = 2003
      pages = 
  extension = .txt
       mime = text/plain
      words = 2522
  sentences = 181
     flesch = 62
    summary = title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 We have closely examined the experimental data and the analyses of the nucleotide sequences presented in the report.We find that aside from problematic details of the experimental design and some erratic presentations of the data the results of the study do not provide evidence for the introgression of recombinant DNA from transgenic crop plants into the genomes of 'criollo' maize. 3. We characterized with the help of BLAST searches those parts of the sequences of the iPCR amplification products that were denoted by Quist and Chapela in their Fig.2 as regions flanking the CMV p-35S sequence.We find that the sequence of AF434754 denoted adh1 in the K1 source of Fig. 2 does not match with the maize adh1 gene. We examined whether the identified regions in the maize genomic DNA from which PCR amplification products were obtained by the authors would perhaps be flanked by primer binding sites.
      cache = ./cache/cord-014674-ey29970v.txt
       txt  = ./txt/cord-014674-ey29970v.txt
=== reduce.pl bib ===
         id = cord-015850-ef6svn8f
     author = Saitou, Naruya
      title = Eukaryote Genomes
       date = 2013-08-22
      pages = 
  extension = .txt
       mime = text/plain
      words = 7424
  sentences = 484
     flesch = 53
    summary = General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] .
      cache = ./cache/cord-015850-ef6svn8f.txt
       txt  = ./txt/cord-015850-ef6svn8f.txt
=== reduce.pl bib ===
         id = cord-018459-isbc1r2o
     author = Munjal, Geetika
      title = Phylogenetics Algorithms and Applications
       date = 2018-12-10
      pages = 
  extension = .txt
       mime = text/plain
      words = 1851
  sentences = 122
     flesch = 42
    summary = This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences. These alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15] . Constructing phylogenetic trees using multiple sequence alignment
      cache = ./cache/cord-018459-isbc1r2o.txt
       txt  = ./txt/cord-018459-isbc1r2o.txt
=== reduce.pl bib ===
         id = cord-033010-o5kiadfm
     author = Durojaye, Olanrewaju Ayodeji
      title = Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study
       date = 2020-10-02
      pages = 
  extension = .txt
       mime = text/plain
      words = 8125
  sentences = 375
     flesch = 53
    summary = RESULTS: This study describes the detailed computational process by which the 2019-nCoV main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3D structure. Our current study took advantage of the availability of the SARS CoV main proteinase amino acid sequence to map out the nucleotide coding region for the same protein in the 2019-nCoV. The predicted secondary structure composition shows a high degree of alpha helix and beta sheets, respectively, occupying 45 and 47% of the total residues with the percentage loop occupancy at 8% regarded as comparative modeling, constructs atomic models based on known structures or structures that have been determined experimentally and likewise share more than 40% sequence homology.
      cache = ./cache/cord-033010-o5kiadfm.txt
       txt  = ./txt/cord-033010-o5kiadfm.txt
=== reduce.pl bib ===
         id = cord-012975-u87ol3fs
     author = Ogiwara, Atsushi
      title = Construction of a dictionary of sequence motifs that characterize groups of related proteins
       date = 1992-09-17
      pages = 
  extension = .txt
       mime = text/plain
      words = 3112
  sentences = 165
     flesch = 55
    summary = An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. The conserved amino acid patterns, often called consensus patterns or sequence motifs (Taylor, 1988; Hodgman, 1989) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. This procedure is applied to the superfamily grouping of the PIR database and a library of sequence motifs is constructed that identifies specific superfamilies. Functional groups of proteins Suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. Because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites.
      cache = ./cache/cord-012975-u87ol3fs.txt
       txt  = ./txt/cord-012975-u87ol3fs.txt
=== reduce.pl bib ===
         id = cord-103029-nc5yf6x4
     author = Wichmann, Stefan
      title = Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank
       date = 2020-09-25
      pages = 
  extension = .txt
       mime = text/plain
      words = 8665
  sentences = 387
     flesch = 52
    summary = In this study the artificially designed sequences are compared to their original sequences in terms of amino acid identity, amino acid similarity, Hidden Markov Model profile and secondary structure in order to determine the impact of OLG construction and which sequences are potentially functional. While the previous study [30] tried to estimate an upper limit of how many domains can be successfully overlapped in at least one reading frame and position, here the average success rate for OLG construction is determined instead, which is more relevant in relation to both understanding constraints on the formation rate of naturally occuring OLGs and in assessing the likelihood of successful synthetic creation of OLGs. These results in one sense give an upper estimate of the ease of creating overlaps as the difficulty of obtaining an overlapping gene pair naturally is not directly addressed here.
      cache = ./cache/cord-103029-nc5yf6x4.txt
       txt  = ./txt/cord-103029-nc5yf6x4.txt
=== reduce.pl bib ===
         id = cord-256608-ajzk86rq
     author = van Weezep, Erik
      title = PCR diagnostics: In silico validation by an automated tool using freely available software programs
       date = 2019-05-13
      pages = 
  extension = .txt
       mime = text/plain
      words = 4950
  sentences = 258
     flesch = 54
    summary = An alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the PCR test as search queries and the program SSEARCH available in the FASTA sequence analysis package (Brenner et al., 1998; Pearson, 1991; Pearson et al., 2017; . The in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective PCR test. To demonstrate the suitability of our in-house developed software tool PCRv, we determined the in silico sensitivity and specificity of three PCR tests for West Nile virus (WNV) recommended by the World Organisation for Animal Health (OIE) (Eiden et al., 2010; Johnson et al., 2001) .
      cache = ./cache/cord-256608-ajzk86rq.txt
       txt  = ./txt/cord-256608-ajzk86rq.txt
=== reduce.pl bib ===
         id = cord-001340-kqcx7lrq
     author = Ladner, Jason T.
      title = Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing
       date = 2014-06-17
      pages = 
  extension = .txt
       mime = text/plain
      words = 2512
  sentences = 121
     flesch = 40
    summary = Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/ DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization.
      cache = ./cache/cord-001340-kqcx7lrq.txt
       txt  = ./txt/cord-001340-kqcx7lrq.txt
=== reduce.pl bib ===
=== reduce.pl bib ===
         id = cord-017584-9rx4jlw8
     author = Kim, Kwangsoo
      title = Selecting Genotyping Oligo Probes Via Logical Analysis of Data
       date = 2007
      pages = 
  extension = .txt
       mime = text/plain
      words = 3665
  sentences = 216
     flesch = 57
    summary = Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. As for the organization of this paper, we develop an effective method for selecting short oligo probes in Section 2 (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in Section 3 with using viral genomic sequences from the Los Alamos National Laboratory and the National Center of Biotechnology Information websites.
      cache = ./cache/cord-017584-9rx4jlw8.txt
       txt  = ./txt/cord-017584-9rx4jlw8.txt
=== reduce.pl bib ===
         id = cord-002473-2kpxhzbe
     author = Das, Jayanta Kumar
      title = Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach
       date = 2017-03-31
      pages = 
  extension = .txt
       mime = text/plain
      words = 4613
  sentences = 285
     flesch = 61
    summary = Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. The primary protein sequence is read as consecutive order pairs serially from first amino acid to the end of sequence, and each order pair is nothing but a connected edge between the two nodes where nodes in the graph are involved with different chemical groups of amino acids. Our method of phylogenetic tree formation used the dissimilarity matrix which is obtained for every pair of sequence on the basis of chemical group specific score of amino acids. Based on the phylogenetic tree of five members, we find that the PpcA and PpcD, PpcB and PpcE are mostly closed with regards to the frequency of amino acids of respective eight chemical groups.
      cache = ./cache/cord-002473-2kpxhzbe.txt
       txt  = ./txt/cord-002473-2kpxhzbe.txt
=== reduce.pl bib ===
         id = cord-010161-bcuec2fz
     author = Matson, David O.
      title = IV, 6. Calicivirus RNA recombination
       date = 2004-09-14
      pages = 
  extension = .txt
       mime = text/plain
      words = 3335
  sentences = 168
     flesch = 45
    summary = With the description of statistically significant phylogenetic clades within CV genera, data were available to recognize strains that might be natural recombinants within CVs. Two examples are the well-characterized Argentine strain 320 (Arg320) and Snow Mountain virus (SMV), one of the prototype CVs, recognized to be recombinants when the RNA polymerase and capsid regions of these strains were characterized (Hardy et al., 1997; Jiang et al., 1999) (Fig. 2) . While SMV was likely also to be a recombinant virus, the capsid and RNA polymerase region amplicons of SMV were generated separately and that fact did not exclude the possibility of different sources of strains. Infection of single cells simultaneously by two CVs implies absence of immune or molecular and of 40 nt near the 5' end of that strain's capsid gene (ID="B" sequence for this Fig.) . The sequence data indicated that recombination in strain Arg320 occurred at the ORF1/capsid gene junction where high sequence identity exists between the putative parent clades.
      cache = ./cache/cord-010161-bcuec2fz.txt
       txt  = ./txt/cord-010161-bcuec2fz.txt
=== reduce.pl bib ===
         id = cord-011565-8ncgldaq
     author = Elworth, R A Leo
      title = To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics
       date = 2020-06-04
      pages = 
  extension = .txt
       mime = text/plain
      words = 12960
  sentences = 717
     flesch = 53
    summary = For instance, in (1) a comprehensive review was performed covering probabilistic algorithms and data structures such as MinHash (6) and Locality Sensitive Hashing (LSH) (7) , Count-Min Sketch (CMS) (8) , HyperLogLog (9) and Bloom filters (10) . A more in depth discussion of many of these topics can also be found in (3, 4) includes a thorough review of compressed string indexes, LSH via sketches, CMS, Bloom filters, and minimizers (13) , with accompanying applications in genomics for each. With this approach, RAMBO can determine which datasets contain a given k-mer or sequence using far fewer Bloom filter queries, yielding a very fast sublinear-time sequence search algorithm (68) . One of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically MinHash and Minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses.
      cache = ./cache/cord-011565-8ncgldaq.txt
       txt  = ./txt/cord-011565-8ncgldaq.txt
=== reduce.pl bib ===
         id = cord-005060-n901y2d4
     author = ZHANG, Feiyun
      title = Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus
       date = 2001
      pages = 
  extension = .txt
       mime = text/plain
      words = 2602
  sentences = 173
     flesch = 62
    summary = The largest ORF 2 encodes a polyprotein of 947 amino acids (103.6 kDa), which codes for a serine protease and an RNA-dependent RNA polymerase. The genome sequence of sobernoviruses has been determined in Southern bean mosaic virus (SBMV)'2,24), CfMV8315), Rice yellow mottle virus (RYMV)") and Lucerne transient streak virus (LTSV, accession number U31286). However, the con-served sequence, WAG + E/D rich sequence is detected in the region, and putative E/S cleavage sites are present on both sides of the region : proteolytic cleavage would result in a protein of 9 kDa. Possibly, the VPg of RGMoV is located between the protease and the RNA-dependent RNA polymerase domains in the same order as in the SBMV ORF 222) (Fig. 3) . In the RGMoV RNA sequence, no ORF corresponds to the second largest product of 68 kDa. The putative replicase of CfMV is translated as part of a single polyprotein by -1 ribosomal frameshifting between two overlapping ORFs having a coding capacity for 60.9 kDa and 56.3 kDa proteins7J8).
      cache = ./cache/cord-005060-n901y2d4.txt
       txt  = ./txt/cord-005060-n901y2d4.txt
=== reduce.pl bib ===
         id = cord-001537-i34vmfpp
     author = Lima, Francisco Esmaile de Sales
      title = Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil
       date = 2015-02-17
      pages = 
  extension = .txt
       mime = text/plain
      words = 3874
  sentences = 195
     flesch = 53
    summary = The predicted protein sequences encoded by ORF2 (cap) and ORF1 (rep) of BatCV I-VI genomes were used for phylogenetic analysis with representative and recently discovered circoviruses/cycloviruses; Pepper golden mosaic virus was used as outgroup, as they are somewhat related to other members in the Circoviridae family (Fig. 3A, 3B and 3C ). The phylogenetic analysis constructed based on the alignments of the complete REP and CAP protein confirms that BatCV POA/II and VI cluster into the genus Cyclovirus along with the Chinese cycloviruses sequences clade detected in bat feces [18] and sharing less than 65% of identity at the CAP/REP amino acid level. BatCV POA I and V had a low amino acid identity with CAP (<20%) and REP (<10%) sequences of two other sequences detected in bat feces in this study with known circoviruses/cycloviruses (Table 2) .
      cache = ./cache/cord-001537-i34vmfpp.txt
       txt  = ./txt/cord-001537-i34vmfpp.txt
=== reduce.pl bib ===
         id = cord-256278-jvfjf7aw
     author = Feng, Jie
      title = New method for comparing DNA primary sequences based on a discrimination measure
       date = 2010-10-21
      pages = 
  extension = .txt
       mime = text/plain
      words = 2864
  sentences = 186
     flesch = 53
    summary = title: New method for comparing DNA primary sequences based on a discrimination measure Three years after, Blaisdell (1989) proved that the dissimilarity values observed by using distance measures based on word frequencies are directly related to the ones requiring sequence alignment. In Table 2 , we present the similarity/dissimilarity matrix for the full DNA sequences of bÀglobin gene from 10 species listed in Table 1 by our new method. In Fig. 2, we show the phylogenetic tree of 10 bÀglobin gene sequences based on the distance matrix DM, using NJ method. In this paper, we propose a new method for the similarity analysis of DNA sequences. Our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of DNA sequences. Analysis of similarity/ dissimilarity of DNA sequences based on novel 2-D graphical representation A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words
      cache = ./cache/cord-256278-jvfjf7aw.txt
       txt  = ./txt/cord-256278-jvfjf7aw.txt
=== reduce.pl bib ===
         id = cord-103297-4stnx8dw
     author = Widrich, Michael
      title = Modern Hopfield Networks and Attention for Immune Repertoire Classification
       date = 2020-08-17
      pages = 
  extension = .txt
       mime = text/plain
      words = 14093
  sentences = 926
     flesch = 57
    summary = In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. DeepRC sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1D convolutions or LSTMs. In this work, we contribute the following: We demonstrate that continuous generalizations of binary modern Hopfield-networks (Krotov & Hopfield, 2016 Demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. We evaluate the predictive performance of DeepRC and other machine learning approaches for the classification of immune repertoires in a large comparative study (Section "Experimental Results") Exponential storage capacity of continuous state modern Hopfield networks with transformer attention as update rule
      cache = ./cache/cord-103297-4stnx8dw.txt
       txt  = ./txt/cord-103297-4stnx8dw.txt
=== reduce.pl bib ===
         id = cord-000642-mkwpuav6
     author = Moreira, Rebeca
      title = Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing
       date = 2012-04-19
      pages = 
  extension = .txt
       mime = text/plain
      words = 6848
  sentences = 372
     flesch = 45
    summary = title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum. Moreover, a few transcripts encoded by genes putatively involved in the clam immune response against Perkinsus olseni have been reported by cDNA library sequencing [18] . philippinarum transcriptome and another four bivalve species sequences were analyzed by comparative genomics (Crassostrea gigas of the family Ostreidae, Bathymodiolus azoricus and Mytilus galloprovincialis of the family Mytilidae and Laternula elliptica of the family Laternulidae).
      cache = ./cache/cord-000642-mkwpuav6.txt
       txt  = ./txt/cord-000642-mkwpuav6.txt
=== reduce.pl bib ===
         id = cord-255194-4i9fc0r7
     author = Djikeng, Appolinaire
      title = Viral genome sequencing by random priming methods
       date = 2008-01-07
      pages = 
  extension = .txt
       mime = text/plain
      words = 3776
  sentences = 207
     flesch = 51
    summary = An RNase treatment step was added to the SISPA protocol to reduce contaminating exogenous RNAs such as ribosomal RNAs. In the case of polyA-tailed viruses, we perform reverse transcription using a combination of random (FR26RV-N) and poly T tagged (FR40RV-T) primers in order to increase the coverage of the 3' end ( Figure 2 ). Additionally, in order to capture 5' ends of viral RNA, a random hexamer primer tagged with a conserved sequence at the 5' end was added to the Klenow reaction (Figure 2 shows a 5' oligo specific for rhinoviruses). The results of these experiments demonstrate that the SISPA method is very efficient as a genome sequencing method for samples with greater than 10 6 viral particles per RT-PCR reaction ( Figure 5 ). We strongly anticipate that specific adaptations of the SISPA method to conserved regions of different viruses will demonstrate its versatility in a wide range of viral genome sequencing initiatives.
      cache = ./cache/cord-255194-4i9fc0r7.txt
       txt  = ./txt/cord-255194-4i9fc0r7.txt
=== reduce.pl bib ===
         id = cord-023647-dlqs8ay9
     author = nan
      title = Sequences and topology
       date = 2003-03-21
      pages = 
  extension = .txt
       mime = text/plain
      words = 4505
  sentences = 747
     flesch = 69
    summary = Nucleotide Sequence Analysis of the L G~ne of Vesicular Stomafltia Virus (New Jersey Serotype) --Identification of Conserved Domai~L~ in L Proteins of Nonsegmented Negative-Strand RNA Viruses DERSE I~ Equine Infectious Anemia Virus tat--Insights into the Structure, Function, and Evolution of Lentivtrus tran.~Activator Proteins Ho~tu~ ~ s71 is a Ehylngcueticellly Distinct Human Endogenous Reteovtgal 1Rlement with Structural mad Sequence Homology to Simian Sarcoma Virus (SSV). Distinct Fercedoxins from Rhodobacter-Capsulstus -Complete Amino Acid Sequences and Molecular Evolution Complete Amino Acid Sequence and Homologies of Human Erythrocyte Membrane Protein Band 4.2. Identification of Two Highly Conserved Amino Acid Sequences Amon~ the ~x-subunits and Molecular ~ The Predicted Amino Acid Sequence of ct-lnternexin is that of a novel Neuronal lntegmedla~ ~ent Protein Inttaspecific Evolution of a Gene Family Coding for Urinary Proteins Attalysi~ of CDNA for Human ~ AJudgyrin I~dicltes a Repeated Structure with Homology to Tissue-Differentiation a~td Cell-Cycle Control Protein
      cache = ./cache/cord-023647-dlqs8ay9.txt
       txt  = ./txt/cord-023647-dlqs8ay9.txt
=== reduce.pl bib ===
         id = cord-016594-lj0us1dq
     author = Flower, Darren R.
      title = Identification of Candidate Vaccine Antigens In Silico
       date = 2012-09-28
      pages = 
  extension = .txt
       mime = text/plain
      words = 12570
  sentences = 653
     flesch = 37
    summary = In the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. When looking at a reverse vaccinology process, the discovery of candidate subunit vaccines begins with a microbial genome, perhaps newly sequence, progresses through an extensive computational stage, ultimately to deliver a shortlist of antigens which can be validated through subsequent laboratory examination. Conventional empirical, experimental, laboratory-based microbiological ways to identify putative candidate antigens require cultivation of target pathogenic micro-organisms, followed by teasing out their component proteins, analysis in a series of in-vitro and in-vivo assays, animal models and with the ultimate objective of isolating one or two proteins displaying protective immunity.
      cache = ./cache/cord-016594-lj0us1dq.txt
       txt  = ./txt/cord-016594-lj0us1dq.txt
=== reduce.pl bib ===
         id = cord-022348-w7z97wir
     author = Sola, Monica
      title = Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing?
       date = 2007-09-02
      pages = 
  extension = .txt
       mime = text/plain
      words = 10892
  sentences = 671
     flesch = 56
    summary = An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships. Under the rubric replication, a virus could vary to increase its fitness, exploit different target cells or evade adaptive immune responses. For a given virus, different protein sequence sets were compared to a given reference such as RT in the case of HIV/SIV. Although these data were derived from completely sequenced primate immunodeficiency viral genomes, analyses on larger data sets, such as p17 Gag/p24 Gag or gp120/gp41, yielded relative values that differed from those given in Table 6 .1 by at most 14%. An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships (Table 6 .1). In the clear cases where genetic variation is exploited by RNA viruses, it is used to overcome barriers to transmission set up by the host population, e.g. herd immunity.
      cache = ./cache/cord-022348-w7z97wir.txt
       txt  = ./txt/cord-022348-w7z97wir.txt
=== reduce.pl bib ===
         id = cord-264296-0x90yubt
     author = Sawmya, Shashata
      title = Analyzing hCov genome sequences: Applying Machine Intelligence and beyond
       date = 2020-06-03
      pages = 
  extension = .txt
       mime = text/plain
      words = 5008
  sentences = 312
     flesch = 60
    summary = We present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. C. Several CNN-RNN based models are used to predict mutations at specific Sites of Interest (SoIs) of the sars-cov-2 genome sequence followed by further analyses of the same on several South-Asian countries. D. Overall, we present an analysis pipeline that can be further utilized as well as extended and revised (a) to study where a newly discovered genome sequence lies in relation to its predecessors in different regions of the world; (b) to analyse its virulence with respect to the number of deaths its predecessors have caused in their respective countries and (c) to analyse the mutation at specific important sites of the viral genome.
      cache = ./cache/cord-264296-0x90yubt.txt
       txt  = ./txt/cord-264296-0x90yubt.txt
=== reduce.pl bib ===
         id = cord-264135-s2u76pvk
     author = Patel, Amrutlal K.
      title = Complete genome sequence analysis of chicken astrovirus isolate from India
       date = 2016-12-23
      pages = 
  extension = .txt
       mime = text/plain
      words = 3755
  sentences = 217
     flesch = 49
    summary = Phylogenetic analysis of the astrovirus genomes suggested formation of separate cluster of chicken astroviruses and placed CAstV/INDIA/ANAND/2016 nearest to the CAstV/4175 isolate (Fig. 2) . B-cell epitope analysis of capsid structural protein of identified chicken astrovirus isolate A total of 9-10 epitopes were predicted using SVMTriP using the capsid protein sequence of the astroviruses. Phylogenetic analysis of the genome sequences as well as the protein sequences showed clustering of the CAstV/ INDIA/ANAND/2016 nearest to that of CastV/4175 and CAstV/GA2011 and all four chicken astrovirus formed separate cluster except capsid protein of the CAstV/Poland/G059/ 2014 isolate which was clustered along with the duck astroviruses. The analysis of capsid protein sequence of reported chicken astroviruses from India revealed limited structural divergence suggesting their common ancestral origin and recent emergence. Fig. 4 Phylogenetic relatedness of chicken astrovirus isolate CAstV/India/Anand/2016 ORF2 coding sequences (a) and ORF2 encoded capsid protein (b) with reported Indian isolates based on neighbour-joining method with
      cache = ./cache/cord-264135-s2u76pvk.txt
       txt  = ./txt/cord-264135-s2u76pvk.txt
=== reduce.pl bib ===
         id = cord-203232-1nnqx1g9
     author = Canturk, Semih
      title = Machine-Learning Driven Drug Repurposing for COVID-19
       date = 2020-06-25
      pages = 
  extension = .txt
       mime = text/plain
      words = 5023
  sentences = 257
     flesch = 52
    summary = Using the National Center for Biotechnology Information virus protein database and the DrugVirus database, which provides a comprehensive report of broad-spectrum antiviral agents (BSAAs) and viruses they inhibit, we trained ANN models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. Using sequences for SARS-CoV-2 (the coronavirus that causes COVID-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating COVID-19. For Experiment II, we split the data on virus species, meaning the models were forced to predict drugs for a species that it was not trained on, and have to detect peptide substructures in the amino-acid sequences to suggest drugs. In post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence.
      cache = ./cache/cord-203232-1nnqx1g9.txt
       txt  = ./txt/cord-203232-1nnqx1g9.txt
=== reduce.pl bib ===
         id = cord-035033-osjy88rc
     author = Aydin, Berkay
      title = Spatiotemporal event sequence discovery without thresholds
       date = 2020-11-09
      pages = 
  extension = .txt
       mime = text/plain
      words = 8231
  sentences = 430
     flesch = 54
    summary = Here, we introduce a novel algorithm, RAND-ESMINER, which, by randomly repeating the mining process on a random subset of instances and follow relationships, finds an estimate participation index for event sequences. The RAND-ESMINER uses our pattern growth-based ESGROWTH algorithm [4] as the backbone, where the follow relationships are translated into a directed acyclic graph structure, and randomly permutes the edges of this graph to mine the event sequences. They defined a follow relation between the pointbased event instances of two different event types, presented significance measures for sequences, and introduced two pattern-growth based algorithms for the mining task. In this paper, we will focus on mining STESs using a randomization approach, which will take a set of spatiotemporal event instances as input and returns all the discovered STESs together with a list of estimated participation index values for each STES, obtained from randomized trials.
      cache = ./cache/cord-035033-osjy88rc.txt
       txt  = ./txt/cord-035033-osjy88rc.txt
=== reduce.pl bib ===
         id = cord-266288-buc4dd5y
     author = Dong, Rui
      title = A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance
       date = 2019-04-09
      pages = 
  extension = .txt
       mime = text/plain
      words = 5247
  sentences = 300
     flesch = 61
    summary = Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ(18). The natural vector method performs well on many datasets (Deng et al., 2011; Yu et al., 2013b; Hoang et al., 2016; Li et al., 2016) , however, it only considers the number, average position and dispersion of positions of each nucleotide. In this paper, we propose a new Accumulated Natural Vector (ANV) method, which not only considers the basic property of each nucleotide, but also the covariance between them. In this paper, we propose an Accumulated Natural Vector approach, which projects each sequence into a point in R 18 , where the additional six dimensions describe the covariance between nucleotides.
      cache = ./cache/cord-266288-buc4dd5y.txt
       txt  = ./txt/cord-266288-buc4dd5y.txt
=== reduce.pl bib ===
         id = cord-266960-kyx6xhvj
     author = Temple, Mark D.
      title = Real-time audio and visual display of the Coronavirus genome
       date = 2020-10-02
      pages = 
  extension = .txt
       mime = text/plain
      words = 6780
  sentences = 360
     flesch = 56
    summary = The sonification of codons derived from all three reading frames of the viral RNA sequence in combination with sonified metadata provide the framework for this display. CONCLUSION: The auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the RNA genome. Audio generated from each of these sequence motifs and metadata were combined to create a complex auditory display to represent either transcription or translation. High resolution analysis of gene expression in Coronavirus genomes has detected ribosome protected fragments which map to non-canonical ORF's, these may be novel protein-coding ORFs and short regulatory uORFs. The tool highlights the occurrence of one such uORF of 30 nucleotides (including the stop codon) in the 5′ untranslated region downstream of TRS1 [35] that is not documented in the GenBank metadata. In the Additional file 4: supplementary example 'Sonification Sub-genomic RNA' the auditory display represents the process of transcription.
      cache = ./cache/cord-266960-kyx6xhvj.txt
       txt  = ./txt/cord-266960-kyx6xhvj.txt
=== reduce.pl bib ===
         id = cord-018133-2otxft31
     author = Altman, Russ B.
      title = Bioinformatics
       date = 2006
      pages = 
  extension = .txt
       mime = text/plain
      words = 9592
  sentences = 462
     flesch = 46
    summary = Experimentation and bioinformatics have divided the research into several areas, and the largest are: (1) genome and protein sequence analysis, (2) macromolecular structure-function analysis, (3) gene expression analysis, and (4) proteomics. With the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. The Entrez system from the National Center for Biological Information (NCBI) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the Human Genome Project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources.
      cache = ./cache/cord-018133-2otxft31.txt
       txt  = ./txt/cord-018133-2otxft31.txt
=== reduce.pl bib ===
         id = cord-001786-ybd8hi8y
     author = Dutilh, Bas E
      title = Metagenomic ventures into outer sequence space
       date = 2014-12-15
      pages = 
  extension = .txt
       mime = text/plain
      words = 2193
  sentences = 121
     flesch = 44
    summary = These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. Applications include the use of metagenomics for the discovery of novel genetic functionality, 2 for describing microbial ecosystems and tracking their variation, 3 in untargeted medical diagnostics and forensics, 4 and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. The level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database.
      cache = ./cache/cord-001786-ybd8hi8y.txt
       txt  = ./txt/cord-001786-ybd8hi8y.txt
=== reduce.pl bib ===
         id = cord-003316-r5te5xob
     author = Balloux, Francois
      title = From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic
       date = 2018-12-17
      pages = 
  extension = .txt
       mime = text/plain
      words = 7340
  sentences = 327
     flesch = 34
    summary = WGS-based strain identification gives a far superior resolution In principle, WGS can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. As an example, genome assembly might appear to be a bottleneck for real-time WGS diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. These include, among others: the current costs of WGS, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable AMR and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols.
      cache = ./cache/cord-003316-r5te5xob.txt
       txt  = ./txt/cord-003316-r5te5xob.txt
=== reduce.pl bib ===
         id = cord-300796-rmjv56ia
     author = nan
      title = The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation
       date = 1990-09-01
      pages = 
  extension = .txt
       mime = text/plain
      words = 8031
  sentences = 405
     flesch = 57
    summary = In this work we show that the p62 protein of Semliki Forest virus contains an uncleaved signal sequence at its NH2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. As the glycosylation of the signal sequence most likely occurs after its release from the ER membrane our results suggest that this region has no role in completing the transfer process. Furthermore, the p62-reporter hybrid should be translocated across microsomal membranes and possibly glycosylated at Asn~3 of the p62 sequence if the 40 residues long NH2-terminal p62 peptide carries a signal sequence. This must involve Asn~3 of the p62 peptide as it is part of the only potential glycosylation site on the hybrid polypeptides (Garoff et al., 1980 ; references on dhfr sequence in legend to Fig. 1) , Finally, we can also conclude that the p62 signal sequence does not provide a stable membrane anchor to the translocated chain.
      cache = ./cache/cord-300796-rmjv56ia.txt
       txt  = ./txt/cord-300796-rmjv56ia.txt
=== reduce.pl bib ===
         id = cord-017932-vmtjc8ct
     author = Georgiev, Vassil St.
      title = Genomic and Postgenomic Research
       date = 2009
      pages = 
  extension = .txt
       mime = text/plain
      words = 8476
  sentences = 360
     flesch = 36
    summary = The family Enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (Salmonella, Yersinia, Klebsiella, Shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic Escherichia coli K12. To this end, NIAID has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. The availability of microbial and human DNA sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host's immune response and an individual's genetic susceptibility to pathogens. The PFGRC was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases.
      cache = ./cache/cord-017932-vmtjc8ct.txt
       txt  = ./txt/cord-017932-vmtjc8ct.txt
=== reduce.pl bib ===
         id = cord-265857-fs6dj3dp
     author = Liu, Yu-Tsueng
      title = Infectious Disease Genomics
       date = 2010-12-24
      pages = 
  extension = .txt
       mime = text/plain
      words = 4341
  sentences = 233
     flesch = 45
    summary = The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002 (Gardner et al., 2002; Holt et al., 2002) . Genome sequencing projects for other important human disease vectors are in progress Megy et al., 2009 ). One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. The completed or ongoing genome projects (Table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control.
      cache = ./cache/cord-265857-fs6dj3dp.txt
       txt  = ./txt/cord-265857-fs6dj3dp.txt
=== reduce.pl bib ===
         id = cord-010273-0c56x9f5
     author = Simmonds, Peter
      title = Virology of hepatitis C virus
       date = 2001-10-10
      pages = 
  extension = .txt
       mime = text/plain
      words = 7897
  sentences = 337
     flesch = 41
    summary = 1,2 The identification of HCV led to the development of diagnostic assays for infection, based either on detection of antibody to recombinant polypeptides expressed from cloned HCV sequences or direct detection of virus ribonucleic acid (RNA) sequences by polymerase chain reaction (PCR) using primers complimentary to the HCV genome. 6 '13 Remarkably, a series of plant viruses that are structurally distinct from each of the mammalian virus groups, and with different genome organizations, have RNA-dependent RNA polymerase amino acid sequences that are perhaps more similar to those of HCV than are the flaviviruses. In contrast to the highly restricted sequence diversity of the 5'NCR and adjacent core region, the two putative envelope genes are highly divergent between different variants of HCV (Table III) 111-114 and show a three-to-four-times higher rate of sequence change with time in persistently infected patients, ll5 Because these proteins are likely to lie on the outside of the virus, they would be the principal targets of the humoral immune response to HCV elicited on infection.
      cache = ./cache/cord-010273-0c56x9f5.txt
       txt  = ./txt/cord-010273-0c56x9f5.txt
=== reduce.pl bib ===
         id = cord-010499-yefxrj30
     author = Yelverton, Elizabeth
      title = The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli
       date = 2006-10-27
      pages = 
  extension = .txt
       mime = text/plain
      words = 5883
  sentences = 330
     flesch = 60
    summary = Ribosomal frameshifting in both rightward and leftward directions has also been shown to occur at certain 'hungry' codons whose cognate aminoacyi-tRNAs are in short supply (Gallant and Foley, 1980; Weiss and Gailant, 1983; 1986; Gallant et ai, 1985; Kurland and Gallant, 1986) . Not all hungry codons are equally prone to shift: in a survey of 21 frameshift mutations of the rllB gene of phage T4, Weiss and Gallant (1986) found that oniy a minority were phenotypicaily suppressible when challenged by limitation for any of several aminoacyl-tRNAs. The context njies governing ribosome frameshifting at hungry sites are under investigation, and have been defined in a few cases (Weiss et al., 1988; Gallant and Lindsiey, 1992; Peter et ai. coli the rate of ribosomal frameshifting on that sequence can be increased by limitation for leucine, the amino acid encoded at the frameshift site.
      cache = ./cache/cord-010499-yefxrj30.txt
       txt  = ./txt/cord-010499-yefxrj30.txt
=== reduce.pl bib ===
         id = cord-263987-ff6kor0c
     author = Holmes, Ian H.
      title = Solving the master equation for Indels
       date = 2017-05-12
      pages = 
  extension = .txt
       mime = text/plain
      words = 7131
  sentences = 357
     flesch = 44
    summary = BACKGROUND: Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT: This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS: While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
      cache = ./cache/cord-263987-ff6kor0c.txt
       txt  = ./txt/cord-263987-ff6kor0c.txt
=== reduce.pl bib ===
         id = cord-022494-d66rz6dc
     author = Webb, B.
      title = Comparative Modeling of Drug Target Proteins
       date = 2014-10-01
      pages = 
  extension = .txt
       mime = text/plain
      words = 8782
  sentences = 453
     flesch = 47
    summary = Comparative modeling consists of four main steps 23 (Figure 2 (a)): (1) fold assignment that identifies similarity between the target sequence of interest and at least one known protein structure (the template); (2) alignment of the target sequence and the template(s); (3) building a model based on the alignment with the chosen template(s); and (4) predicting model errors. Modeller implements comparative protein structure modeling by the satisfaction of spatial restraints that include: (1) homologyderived restraints on the distances and dihedral angles in the target sequence, extracted from its alignment with the template structures; 35 (2) stereochemical restraints such as bond length and bond angle preferences, obtained from the CHARMM-22 molecular mechanics force field; 107 (3) statistical preferences for dihedral angles and nonbonded interatomic distances, obtained from a representative set of known protein structures; 108 and (4) optional manually curated restraints, such as those from NMR spectroscopy, rules of secondary structure packing, cross-linking experiments, fluorescence spectroscopy, image reconstruction from electron microscopy, site-directed mutagenesis, and intuition ( Figure 2(b) ).
      cache = ./cache/cord-022494-d66rz6dc.txt
       txt  = ./txt/cord-022494-d66rz6dc.txt
=== reduce.pl bib ===
         id = cord-253436-dz84icdc
     author = Wille, Michelle
      title = High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl
       date = 2016-03-03
      pages = 
  extension = .txt
       mime = text/plain
      words = 2019
  sentences = 103
     flesch = 54
    summary = In this study we screened 764 samples from 22 avian species of the orders Anseriformes and Charadriiformes in Sweden collected in 2006/2007 for CoV, with an overall CoV prevalence of 18.7%, which is higher than many other wild bird surveys. Coronavirus sequences from Mallards in this study were highly similar to CoV sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. Despite few studies, small samples sizes and differences in prevalence, what is clear, is that in the Northern Hemisphere waterfowl species, especially dabbling and diving ducks are important in the epidemiology of avian CoVs. It is interesting to note that these patterns are very similar to those found in low pathogenic influenza A viruses: high prevalence in waterfowl and gulls in the Northern Hemisphere [30] , and little host species and temporal structuring within waterfowl derived viruses in the conserved polymerase genes (such as PB2, PB1) [31] .
      cache = ./cache/cord-253436-dz84icdc.txt
       txt  = ./txt/cord-253436-dz84icdc.txt
=== reduce.pl bib ===
         id = cord-193910-7p3f3znj
     author = Zhang, Xiangxie
      title = Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification
       date = 2020-11-01
      pages = 
  extension = .txt
       mime = text/plain
      words = 7724
  sentences = 436
     flesch = 59
    summary = In the experiments, the performances of feature extraction using primers and random DNA sequences will be compared to several other machine learning approaches. Finally, three state-of-the-art methods, namely a con-volutional neural network (CNN), a deep neural network (DNN), and an N-gram probabilistic model, which were fed the unprocessed DNA sequences without prior feature extraction, were tested. Different machine learning algorithms will be trained and tested using each set of feature vectors in the experiments. For each data set, the results of all six machine learning algorithms using the random DNA sequence feature extraction method are presented in Table ( 8) containing mean accuracy and standard deviation over the ten folds of the cross-validation. It can be concluded that the Levenshtein distance feature extraction yields the best and most consistent results across the six different machine learning algorithms when the distance between a primer and a DNA sequence is taken.
      cache = ./cache/cord-193910-7p3f3znj.txt
       txt  = ./txt/cord-193910-7p3f3znj.txt
=== reduce.pl bib ===
         id = cord-017354-cndb031c
     author = Janies, D.
      title = Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases
       date = 2008
      pages = 
  extension = .txt
       mime = text/plain
      words = 12429
  sentences = 648
     flesch = 45
    summary = The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. Given a tree and a data matrix of sequences and features, the parsimony method can pinpoint the branches on which certain evolutionary events are inferred to occur between ancestor or descendent. Phylogenetic analysis of large genomic datasets can present several nested NPcomplete problems: multiple alignment, tree-search, and in some cases, gene order and complement differences among organisms. We provide exemplar cases in which phylogenetic analyses of viral genomes have been crucial to understand complex patterns of transmission among animal and human hosts: Severe Acute Respiratory Syndrome (SARS) [KSI03] and influenza [WEB92] . Molecular phylogenetic analyses of the nucleotide or inferred amino acid sequence data from various viral isolates can then be used to reconstruct the history of the transmission events the virus among hosts.
      cache = ./cache/cord-017354-cndb031c.txt
       txt  = ./txt/cord-017354-cndb031c.txt
=== reduce.pl bib ===
         id = cord-255371-o9oxchq6
     author = Nguyen, Thanh Thi
      title = Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus)
       date = 2020-07-10
      pages = 
  extension = .txt
       mime = text/plain
      words = 5640
  sentences = 365
     flesch = 59
    summary = title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. We use 6,324 SARS-CoV-2 genome sequences collected in 45 countries and deposited to the NCBI GenBank so far and create a spreadsheet dataset of all mutations occurred across different genes. In this paper, to evaluate the possible impacts of genomic mutations on the virus functions, we propose the use of the SSpro/ACCpro 5 methods to predict protein secondary structure and relative solvent accessibility [13] . By comparing the prediction results obtained on the reference genome and mutated genomes, we are able to assess whether the detected mutations have the potential to change the protein structure and solvent accessibility, and thus lead to possible changes of the virus characteristics.
      cache = ./cache/cord-255371-o9oxchq6.txt
       txt  = ./txt/cord-255371-o9oxchq6.txt
=== reduce.pl bib ===
         id = cord-014462-11ggaqf1
     author = nan
      title = Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh
       date = 2011-04-21
      pages = 
  extension = .txt
       mime = text/plain
      words = 35453
  sentences = 1711
     flesch = 49
    summary = Molecular diagnosis based on reverse transcription (RT)-PCR s.a. one step or nested PCR, nucleic acid sequence based amplification (NASBA), or real time RT-PCR, has gradually replaced the virus isolation method as the new standard for the detection of dengue virus in acute phase serum samples. Non-genetic methods of management of these diseases include quarantine measures, eradication of infected plants and weed hosts, crop rotation, use of certified virus-free seed or planting stock and use of pesticides to control insect vector populations implicated in transmission of viruses. The results of this study indicate that NS1 antigen based ELISA test can be an useful tool to detect the dengue virus infection in patients during the early acute phase of disease since appearance of IgM antibodies usually occur after fifth day of the infection. The studies showed high level of expression in case of constructed vector as compared to infected virus for the specific protein.
      cache = ./cache/cord-014462-11ggaqf1.txt
       txt  = ./txt/cord-014462-11ggaqf1.txt
=== reduce.pl bib ===
         id = cord-014461-2ubh9u8r
     author = Nelson, Oranmiyan W.
      title = Genome sequences published outside of Standards in Genomic Sciences, July - October 2012
       date = 2012-10-10
      pages = 
  extension = .txt
       mime = text/plain
      words = 4124
  sentences = 454
     flesch = 44
    summary = Complete Genome Sequence of Brucella abortus A13334, a New Strain Isolated from the Fetal Gastric Fluid of Dairy Cattle Complete Genome Sequence of Brucella canis Strain HSK A52141, Isolated from the Blood of an Infected Dog Complete Genome Sequence of Streptococcus salivarius PS4, a Strain Isolated from Human Milk Complete Genome Sequences of Probiotic Strains Bifidobacterium animalis subsp. Complete Genome Sequence of Corynebacterium pseudotuberculosis Strain 1/06-A, Isolated from a Horse in North America Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Characterization and Complete Genome Sequence of Human Coronavirus NL63 Isolated in China Complete Genome Sequence of a Novel Pararetrovirus Isolated from Soybean Complete Genome Sequence of a Polyomavirus Isolated from Horses Complete Genome Sequence of a Novel Porcine Sapelovirus Strain YC2011 Isolated from Piglets with Diarrhea Draft Genome Sequence of Aspergillus oryzae Strain 3.042
      cache = ./cache/cord-014461-2ubh9u8r.txt
       txt  = ./txt/cord-014461-2ubh9u8r.txt
=== reduce.pl bib ===
         id = cord-268549-2lg8i9r1
     author = Dai, Qi
      title = Sequence comparison via polar coordinates representation and curve tree
       date = 2012-01-07
      pages = 
  extension = .txt
       mime = text/plain
      words = 4360
  sentences = 272
     flesch = 59
    summary = It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation
      cache = ./cache/cord-268549-2lg8i9r1.txt
       txt  = ./txt/cord-268549-2lg8i9r1.txt
=== reduce.pl bib ===
         id = cord-001974-wjf3c7a7
     author = Friis-Nielsen, Jens
      title = Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
       date = 2016-02-19
      pages = 
  extension = .txt
       mime = text/plain
      words = 5773
  sentences = 348
     flesch = 48
    summary = Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. The datasets went through a sequential pipeline with modules (in order) of preprocessing, computational subtraction of host sequences, low-complexity sequence removal, sequence assembly, clustering, association to metadata features, and taxonomical annotation. Associations from the shortest mode tended to have higher dispersion in the range of ORs. Furthermore, one block of clustering results using global alignment mode, alignment length based on the shortest contig, and a minimum sequence identity of 90% (c09ˆaSyG1), had an overall high range of ORs as well as the highest minimum values. The clusters are significantly associated with lowest p-values to biological features and the species annotations are described by HMP.
      cache = ./cache/cord-001974-wjf3c7a7.txt
       txt  = ./txt/cord-001974-wjf3c7a7.txt
=== reduce.pl bib ===
         id = cord-275258-azpg5yrh
     author = Mead, Dylan J.T.
      title = Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling
       date = 2019-07-26
      pages = 
  extension = .txt
       mime = text/plain
      words = 6333
  sentences = 346
     flesch = 53
    summary = title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. We then present the first use of force-directed graphs to produce an intuitive visualization of sequence space, and select target RdRPs without solved structures for homology modelling. The solved structure has 10 other sequences in its proximity in the three-dimensional space, roughly Table 5 Homology modelling at intra-order, inter-family level.
      cache = ./cache/cord-275258-azpg5yrh.txt
       txt  = ./txt/cord-275258-azpg5yrh.txt
=== reduce.pl bib ===
         id = cord-023208-w99gc5nx
     author = nan
      title = Poster Presentation Abstracts
       date = 2006-09-01
      pages = 
  extension = .txt
       mime = text/plain
      words = 70854
  sentences = 3492
     flesch = 43
    summary = In order to develop a synthetic protocol by an automated instrumentation, increasing yield, purity of the crude, and reaction time, a microwave-assisted solid phase peptide synthesis was validated comparing the use of the new generation of Triazine-Based Coupling Reagents (TBCRs) with a series of commonly used ones. Ubiquitinium is a well known mechanism in protein degredation of Eukaryotic cells ,in which many obsolte and corrupted three dimentional structure protein ,become marked by covalent attachment of ubuquitin through a multi-step enzymatic pathway.Ubiquitin is a small ,8.5 kDa peptide of 76 amino acid residues that targets such substrtes for proteolysis in proteasome .Recnt studies showed that an extra cellular ubiquitination process also taking place in the epididymes of humans and other animals marks protein on the surface of the defective sperm .it appears that structurally and functionally defective sperm become surface ubiquitinated by epididymal epithelial cells. This head-to-tailcyclized 14-amino-acid peptide contains one disulfide bridge and a lysine residue (Lys5) present in the P1 position, which is responsible for inhibitor specificity.As was reported by us and other groups, SFTI-1 analogues with one cycle only retain trypsin inhibitory activity.
      cache = ./cache/cord-023208-w99gc5nx.txt
       txt  = ./txt/cord-023208-w99gc5nx.txt
=== reduce.pl bib ===
         id = cord-321386-u1imic5l
     author = Li, Chun
      title = Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
       date = 2018-02-17
      pages = 
  extension = .txt
       mime = text/plain
      words = 5503
  sentences = 311
     flesch = 59
    summary = METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. Numerical characterization of protein sequences based on the generalized Chou's pseudo amino acid composition
      cache = ./cache/cord-321386-u1imic5l.txt
       txt  = ./txt/cord-321386-u1imic5l.txt
=== reduce.pl bib ===
         id = cord-306725-0vam15pt
     author = Li, Hao
      title = First detection and genomic characteristics of bovine torovirus in dairy calves in China
       date = 2020-05-09
      pages = 
  extension = .txt
       mime = text/plain
      words = 3015
  sentences = 156
     flesch = 58
    summary = Sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the S protein compared to the complete S sequences of BToV available in the GenBank database. A phylogenetic analysis based on the complete amino acid sequence of the S protein showed that the BToVs could be separated into four groups (Fig. 2) , designated tentatively as group 1 to group 4. The bovine torovirus strains BToV/SC-1/China and BToV /SC-2/China investigated in this study are indicated by black triangles Fig. 2 Phylogenetic tree based on the deduced 1586-aa sequence of the complete S gene. Moreover, the two Chinese strains shared identical unique amino acid changes in the S and HE genes when compared to the other strains with sequences available in the GenBank database, indicating the unique evolution in Chinese BToV strains. Moreover, two complete BToV genome sequences were obtained from the clinical samples, and these two BToV isolates had unique amino acid changes in the S and HE proteins.
      cache = ./cache/cord-306725-0vam15pt.txt
       txt  = ./txt/cord-306725-0vam15pt.txt
=== reduce.pl bib ===
         id = cord-027316-echxuw74
     author = Modarresi, Kourosh
      title = Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model
       date = 2020-05-22
      pages = 
  extension = .txt
       mime = text/plain
      words = 2116
  sentences = 148
     flesch = 49
    summary = This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. Deep Learning has become a main model in natural language processing applications [6, 7, 11, 22, 38, 55, 64, 71, 75, 78-81, 85, 88, 94] . Though, modified version of RNN like LSTM and GRU have been improvement over RNN (recurrent neural networks) in dealing with vanishing gradients and long-term memory loss, still they suffer from many deficiencies. Given the complexity of these dependencies, a neural network model is used to compute these weights. The embedding regularization is, α Embedding Error 2 (6) Input to any model has to be a number and hence the raw input of words or text sequence needs to be transformed to continuous numbers. Learning phrase representations using RNN encoder-decoder for statistical machine translation
      cache = ./cache/cord-027316-echxuw74.txt
       txt  = ./txt/cord-027316-echxuw74.txt
=== reduce.pl bib ===
         id = cord-213136-euv6pqh5
     author = Singh, Kulveer
      title = Sequence Effects on Internal Structure of Droplets of Associative Polymers
       date = 2020-05-17
      pages = 
  extension = .txt
       mime = text/plain
      words = 4329
  sentences = 184
     flesch = 56
    summary = We study the evolution of internal structure of large droplets (morphology of clusters of stickers) and the kinetics of interconversion between intramolecular and intermolecular associations, for different sequences of our model polymers. Since at t = 0 we begin with a dilute solution of associating polymers in poor solvent in which most of the chains contain intramolecular bonds between their stickers, the observation of a second peak that corresponds to intermolecular bridges means that major molecular rearrangement takes place inside droplets formed by polymers with s8s, 1s6s1 and 2s4s2 sequences. For three of the sequences (s8s, 1s6s1 and 2s4s2) we found that the average spatial distance R ss between the two stickers of a polymer inside the condensed droplet has a bimodal distribution, such that one of the peaks corresponds to intramolecular bonds and the other to intermolecular bridges between clusters (or between different parts of a long fiber of stickers).
      cache = ./cache/cord-213136-euv6pqh5.txt
       txt  = ./txt/cord-213136-euv6pqh5.txt
=== reduce.pl bib ===
=== reduce.pl bib ===
=== reduce.pl bib ===
         id = cord-252347-vnn4135b
     author = Lee, Wai-Ming
      title = A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants
       date = 2007-10-03
      pages = 
  extension = .txt
       mime = text/plain
      words = 5672
  sentences = 271
     flesch = 51
    summary = METHODS AND FINDINGS: To directly type HRVs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5' noncoding region with homologous sequences of the 101 known serotypes. The degenerate primers EV292 and EV222 for PCR amplification of NIm-1A region were not sensitive enough for direct detection of small amount of HRV in original clinical samples (data not shown), and high titer infected cell lysates of cultured isolates were needed to produce enough PCR product for cloning and sequencing. This new assay had 3 key components: sensitive pan-HRV primers and semi-nested PCR to amplify P1-P2 region from cDNA prepared from original clinical specimens, a sequence database of 260-bp P1-P2 region of 5'NCR of all 101 HRV serotypes to serve as standard references for HRV identification, and phylogenetic tree reconstruction of the new P1-P2 sequences and the 101 homologous reference sequences.
      cache = ./cache/cord-252347-vnn4135b.txt
       txt  = ./txt/cord-252347-vnn4135b.txt
=== reduce.pl bib ===
         id = cord-264746-gfn312aa
     author = Muse, Spencer
      title = GENOMICS AND BIOINFORMATICS
       date = 2012-03-29
      pages = 
  extension = .txt
       mime = text/plain
      words = 10976
  sentences = 583
     flesch = 58
    summary = The success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of DNA and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of DNA; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. Although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today's environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism's genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research.
      cache = ./cache/cord-264746-gfn312aa.txt
       txt  = ./txt/cord-264746-gfn312aa.txt
=== reduce.pl bib ===
         id = cord-267500-x3u9i1vq
     author = Rose, Rebecca
      title = Challenges in the analysis of viral metagenomes
       date = 2016-08-03
      pages = 
  extension = .txt
       mime = text/plain
      words = 5928
  sentences = 308
     flesch = 40
    summary = Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of lowfrequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. We recently proposed a method based on numerical sequence representations and digital signal processing data transformation (SPDT) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (Tapinos et al.
      cache = ./cache/cord-267500-x3u9i1vq.txt
       txt  = ./txt/cord-267500-x3u9i1vq.txt
=== reduce.pl bib ===
         id = cord-311240-o0zyt2vb
     author = Motayo, Babatunde Olarenwaju
      title = Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences
       date = 2020-07-27
      pages = 
  extension = .txt
       mime = text/plain
      words = 3091
  sentences = 167
     flesch = 50
    summary = Our study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa. The pathogen was later identified to be a novel coronavirus closely related to the severe acute respiratory syndrome virus (SARS), with a possible bat origin (Zhou et al, 2020) . This study was designed to determine to the genetic diversity and evolutionary history of genome sequences of SARSCoV-2 isolated in Africa. Results of recombination analysis of the African SARSCoV-2 (AfrSARSCoV-2) sequences against references whole genome sequences of SARS, Recombination signals were observed between the African SARSCoV-2 sequences and reference sequence (Major recombinant hCoV-19 Pangolin/Guangu P4L/2017; Minor parent hCoV-19 B batYunan/RaTG13) between the RdRP and S gene regions (Figure 2 ).
      cache = ./cache/cord-311240-o0zyt2vb.txt
       txt  = ./txt/cord-311240-o0zyt2vb.txt
=== reduce.pl bib ===
         id = cord-321715-bkfkmtld
     author = Redelings, Benjamin D
      title = Incorporating indel information into phylogeny estimation for rapidly emerging pathogens
       date = 2007-03-14
      pages = 
  extension = .txt
       mime = text/plain
      words = 9793
  sentences = 546
     flesch = 54
    summary = To see if indel information improves phylogenetic resolution we compare the number of bi-partitions that are supported under the joint model and the traditional sequential approach, in which topology reconstruction assumes a previously determined alignment. These parameters include a multiple alignment A that specifies the positional homology between the sequences Y, an evolutionary tree (τ, T) where τ is an unrooted bifurcating tree topology and T = (t 1 , ..., t 2N -3 ) is a vector of branch lengths along the edges in τ, and vectors Θ and Λ are parameters that characterize the letter substitution and indel processes respectively. We therefore propose a new pairwise alignment prior that maintains a fixed sequence length distribution φ even when the indel probability varies from branch to branch. Since the joint model balances substitution and indel information as well as taking alignment ambiguity into account we assume that these differences represent an improvement in the accuracy of estimation.
      cache = ./cache/cord-321715-bkfkmtld.txt
       txt  = ./txt/cord-321715-bkfkmtld.txt
=== reduce.pl bib ===
         id = cord-311839-61djk4bs
     author = Wei, Dan
      title = A novel hierarchical clustering algorithm for gene sequences
       date = 2012-07-23
      pages = 
  extension = .txt
       mime = text/plain
      words = 8033
  sentences = 496
     flesch = 61
    summary = We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. DMk shows better performance than the k-tuple distance in our experiments, and mBKM outperforms SL, CL, AL, BKM and KM when tested on public gene sequence datasets. In this paper we propose a new alignment-free similarity measure, DMk, based on which we developed mBKM to cluster gene sequences. To evaluate the proposed similarity measure, we test DMk on gene sequence data sets and compare it with the k-tuple distance. Moreover, we use our method, mBKM with similarity measure DMk, in phylogenetic analysis to show how well the genes are grouped together and how well the resulting trees agree with existing phylogenies. In order to illustrate the efficiency of mBKM in gene sequence clustering, we ran mBKM with the k-tuple distance and DMk on real data sets listed in Table 1 .
      cache = ./cache/cord-311839-61djk4bs.txt
       txt  = ./txt/cord-311839-61djk4bs.txt
=== reduce.pl bib ===
         id = cord-018963-2lia97db
     author = Xu, Ying
      title = Protein Structure Prediction by Protein Threading
       date = 2010-04-29
      pages = 
  extension = .txt
       mime = text/plain
      words = 15309
  sentences = 716
     flesch = 48
    summary = Their follow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now.
      cache = ./cache/cord-018963-2lia97db.txt
       txt  = ./txt/cord-018963-2lia97db.txt
=== reduce.pl bib ===
=== reduce.pl bib ===
         id = cord-102766-n6mpdhyu
     author = Alam, Md. Nafis Ul
      title = Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses
       date = 2020-06-25
      pages = 
  extension = .txt
       mime = text/plain
      words = 3193
  sentences = 192
     flesch = 56
    summary = title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. VirFinder: a novel k-mer based tool for identifying viral sequences from 558 assembled metagenomic data.
      cache = ./cache/cord-102766-n6mpdhyu.txt
       txt  = ./txt/cord-102766-n6mpdhyu.txt
=== reduce.pl bib ===
=== reduce.pl bib ===
         id = cord-321150-ev6acl7b
     author = Lam, Ha Minh
      title = Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm
       date = 2017-10-03
      pages = 
  extension = .txt
       mime = text/plain
      words = 3184
  sentences = 161
     flesch = 50
    summary = Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed. A strong descent or ascent in the middle of a HGRW indicates that one type of informative site exhibits clustering, and the properties of the random walk can be used to compute exact probabilities of this occurring. To illustrate improved runtimes and memory usage of the new 3SEQ algorithm, we searched for recombinants among large sequence data sets of dengue virus serotype 2, Ebola virus, the coronavirus responsible for Middle-East Respiratory Syndrome (MERS) and Zika virus; see table 1. The genomic alignments of MERS and Zika virus contained 1,150 and 2,792 polymorphic sites, respectively, and >99.9% triplets were able to be tested for mosaicism with exact P values.
      cache = ./cache/cord-321150-ev6acl7b.txt
       txt  = ./txt/cord-321150-ev6acl7b.txt
=== reduce.pl bib ===
         id = cord-302798-q0mbngqy
     author = Ge, Junwei
      title = Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China
       date = 2018-06-14
      pages = 
  extension = .txt
       mime = text/plain
      words = 4343
  sentences = 273
     flesch = 58
    summary = In this study, the role of circoviruses (CVs) in mink acute gastroenteritis was investigated, and the MiCV genome was molecularly characterized through sequence analysis. MiCVs and previously characterized CVs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus DNA replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (Rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5ʹ region; and (iv) motifs in Rep. Pairwise comparisons showed that the capsid proteins of MiCV shared the highest amino acid sequence identity with those of porcine CV (PCV) 2 (45.4%) and bat CV (BatCV) 1 (45.4%). In our study, sequence analysis confirmed that MiCV genomes displayed the characteristics of members of the genus Circovirus, and the common features included their genome organization, the presence of a potential stem-loop and conserved nonanucleotide motif postulated to be the origin of viral DNA replication, and major ORFs and repeats [26, 27] .
      cache = ./cache/cord-302798-q0mbngqy.txt
       txt  = ./txt/cord-302798-q0mbngqy.txt
=== reduce.pl bib ===
         id = cord-266794-oyppubq5
     author = Zhang, Dachuan
      title = SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model
       date = 2020-09-01
      pages = 
  extension = .txt
       mime = text/plain
      words = 1003
  sentences = 75
     flesch = 48
    summary = title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. To circumvent this limitation, we built an integrated 2019-nCoV scientific resource platform and a consensus sequence-catalytic function model with which we developed novel methodology to analyze pathogen sequences for catalytic functions. In addition, we integrated a consensus sequence-function model (Zhang, et al., 2020) , a genome browser (Ham, et al., 2012) , and a catalytic function annotation tool (Dawson, et al., 2017) into the platform to assist in the research of novel viruses. We built an integrated platform to assist 2019-nCoV research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species.
      cache = ./cache/cord-266794-oyppubq5.txt
       txt  = ./txt/cord-266794-oyppubq5.txt
=== reduce.pl bib ===
=== reduce.pl bib ===
         id = cord-280881-5o38ihe0
     author = Wlodawer, Alexander
      title = A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases
       date = 2003-11-11
      pages = 
  extension = .txt
       mime = text/plain
      words = 4862
  sentences = 220
     flesch = 51
    summary = These structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (Ser, Glu, Asp) and by the presence of an Asp in the oxyanion hole [8] . We have now applied the tools of molecular homology modeling to predicting a structure of CLN2 that could be used as a basis for a search for the biological substrates of this family of enzymes and for the design of specific inhibitors. Mammalian enzymes homologous to human CLN2 [2, 4] form a subfamily of sedolisins with highly conserved sequences ( Figure 1 ). Exploiting the sequence similarity between CLN2, sedolisin, and kumamolisin ( Figure 4 ), we have now used the experimentally obtained structures of the latter two enzymes to form a new, homology-derived model of human CLN2.
      cache = ./cache/cord-280881-5o38ihe0.txt
       txt  = ./txt/cord-280881-5o38ihe0.txt
=== reduce.pl bib ===
         id = cord-274056-9t3kneoo
     author = Abd Elwahaab, Marwa A.
      title = A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector
       date = 2019-05-08
      pages = 
  extension = .txt
       mime = text/plain
      words = 3314
  sentences = 251
     flesch = 59
    summary = title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector For beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in Table 1 . The similarity/dissimilarity vectors that are corresponding to beta globin, ND5, and spike protein sequences are illustrated in Tables 9, 10, and 11, respectively, based on the two methods discussed before. The results in Table 10 show that both the magnitude ( 5 ) and the angle ( 5 ) can measure similarity/dissimilarity degree well among ND5 protein sequences as shown in Figure 2 . The similarity/dissimilarity analysis among the seven beta globin sequences measured according to ( 5 ) is illustrated in Table 12 and shown in Figure 4 . The similarity/dissimilarity analysis among the beta globin sequences measured according to (GR spike ) is illustrated in Table 14 and shown in Figure 6 .
      cache = ./cache/cord-274056-9t3kneoo.txt
       txt  = ./txt/cord-274056-9t3kneoo.txt
=== reduce.pl bib ===
         id = cord-325985-xfzhn1n1
     author = Jabado, Omar J.
      title = Comprehensive viral oligonucleotide probe design using conserved protein regions
       date = 2007-12-13
      pages = 
  extension = .txt
       mime = text/plain
      words = 4260
  sentences = 227
     flesch = 47
    summary = The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. Our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. The EMBL nucleotide sequence database [July 2007, Release 91; 461,353 nucleic acid sequences (31) ] was chosen as the reference for this study because it is tightly integrated with the Pfam protein family database (23, 32 Taxon growth was estimated using a standard least squares method, with the SPSS statistical package. We have described a method that capitalizes on the Pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions.
      cache = ./cache/cord-325985-xfzhn1n1.txt
       txt  = ./txt/cord-325985-xfzhn1n1.txt
=== reduce.pl bib ===
         id = cord-268467-btfz6ye8
     author = Schreiber, Steven S.
      title = Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E
       date = 1989-03-31
      pages = 
  extension = .txt
       mime = text/plain
      words = 5035
  sentences = 343
     flesch = 59
    summary = The 3′-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the Coronavirus family and lends support to the theory that this region is important for the replication of negative-strand RNA. This result suggested that the HCV229E subgenomic mRNAs possess a nested-set structure similar to other coronaviruses and that A34 represented a cDNA clone of either the 3'-end of the genomic RNA or the leader sequence. The 3'-noncoding region contains the sequence TGGAAGAGCCA, 75 nucleotides from the 3'-end (Fig. 4) which is relatively conserved among coronaviruses and is found at approximately the same location in all of these viral genomes (Kapke and Brian, 1986; Skinner and Siddell, 1984; Armstrong et a/., 1983; Lapps et al., 1987; Kamahora et a/., 1988; Boursnell et al., 1985) ( Table 1) . Three intergenic regions of coronavirus mouse hepatitis virus strain A59 genome RNA contain a common nucleotide sequence that is homologous to the 3'end of the viral mRNA leader sequence
      cache = ./cache/cord-268467-btfz6ye8.txt
       txt  = ./txt/cord-268467-btfz6ye8.txt
=== reduce.pl bib ===
         id = cord-301827-a7hnuxy5
     author = Uversky, Vladimir N
      title = A decade and a half of protein intrinsic disorder: Biology still waits for physics
       date = 2013-04-29
      pages = 
  extension = .txt
       mime = text/plain
      words = 20971
  sentences = 1059
     flesch = 43
    summary = 94 Therefore, the abundance and peculiarities of the charged residues distribution within the protein sequences might determine physical and biological properties of extended IDPs and IDPRs. Also, simple polymer physics-based reasoning can give reasonably well-justified explanation of the conformational behavior of extended IDPs. In general, the conformational behavior of IDPs is characterized by the low cooperativity (or the complete lack thereof) of the denaturant-induced unfolding, lack of the measurable excess heat absorption peak(s) characteristic for the melting of ordered proteins, "turned out" response to heat and changes in pH, and the ability to gain structure in the presence of various binding partners. 183 This analysis revealed that proteins involved in regulation and execution of PCD possess substantial amount of intrinsic disorder and IDPRs were implemented in a number of crucial functions, such as protein-protein interactions, interactions with other partners including nucleic acids and other ligands, were shown to be enriched in post-translational modification sites, and were characterized by specific evolutionary patterns.
      cache = ./cache/cord-301827-a7hnuxy5.txt
       txt  = ./txt/cord-301827-a7hnuxy5.txt
=== reduce.pl bib ===
         id = cord-300149-djclli8n
     author = Ruan, Yijun
      title = Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection
       date = 2003-05-24
      pages = 
  extension = .txt
       mime = text/plain
      words = 4355
  sentences = 226
     flesch = 54
    summary = title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection METHODS: We sequenced the entire SARS viral genome of cultured isolates from the index case (SIN2500) presenting in Singapore, from three primary contacts (SIN2774, SIN2748, and SIN2677), and one secondary contact (SIN2679). In addition, a common variant associated with a non-conservative aminoacid change in the S1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the SARS virus in human populations. All genetic variations of Singapore isolates identified when compared with available SARS-CoV genome sequences were further confirmed by primer extension genotyping technology (Sequenom, San Diego, CA, USA). These sequences showed that the genomes of SARS-CoV isolated in Singapore are comprised of 29 711 bases, with the exception of a five-nucleotide deletion in strain SIN2748 and a six-nucleotide deletion in SIN2677.
      cache = ./cache/cord-300149-djclli8n.txt
       txt  = ./txt/cord-300149-djclli8n.txt
=== reduce.pl bib ===
         id = cord-279528-41atidai
     author = Abo-Elkhier, Mervat M.
      title = Measuring Similarity among Protein Sequences Using a New Descriptor
       date = 2019-11-22
      pages = 
  extension = .txt
       mime = text/plain
      words = 3045
  sentences = 217
     flesch = 57
    summary = Each amino acid in the protein sequence is represented by a number, and a new 2D graphical representation is suggested. A new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (A t , SA t ). e 2D graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in e 2D graphical representation of TGEVG from class I and GD03T0013 from SARS_CoV protein sequences is illustrated in Figures 4(a) and 4(b) respectively. A new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean A t and standard deviation SA t of the combined intensity level value A t (i) of the protein sequence. F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids
      cache = ./cache/cord-279528-41atidai.txt
       txt  = ./txt/cord-279528-41atidai.txt
=== reduce.pl bib ===
         id = cord-287658-c2lljdi7
     author = Lopez-Rincon, Alejandro
      title = Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning
       date = 2020-09-10
      pages = 
  extension = .txt
       mime = text/plain
      words = 4766
  sentences = 253
     flesch = 55
    summary = The discovered sequences are first validated on samples from other repositories, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. The discovered sequences are validated on samples from NCBI and GISAID, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. For example, we can use this sequencing data with cDNA, resulting from the PCR of the original viral RNA; e,g, Real-Time PCR amplicons to identify the SARS-CoV-2 16 . The global impact of SARS-CoV-2 prompted researchers to apply effective alignment-free methods to the classification of the virus: For example, in 26 the authors propose the use of Machine Learning Digital Signal Processing for separating the virus from similar strains, with remarkable accuracy. We calculated the frequency of appearance of different primer sets' sequences used in SARS-CoV-2 RT-PCR tests developed by WHO referral laboratories and compared it to our primer design in the dataset from the GISAID ( Table 2) repository.
      cache = ./cache/cord-287658-c2lljdi7.txt
       txt  = ./txt/cord-287658-c2lljdi7.txt
=== reduce.pl bib ===
         id = cord-287634-64zqe4cz
     author = Al-Ssulami, Abdulrakeeb M.
      title = CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents
       date = 2020-01-31
      pages = 
  extension = .txt
       mime = text/plain
      words = 2307
  sentences = 137
     flesch = 59
    summary = For generating synthetic coding sequences with pre-specified amino acid sequence and desired GC-content, there exist two stochastic methods, multinomial and maximum entropy. In this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired GC-content. Thus, identifying over/under-represented regulatory elements or genome-scale patterns relies on generating random sequences that obey the pre-specified amino acid sequence and GC-content constraints. A more restricted method was presented recently, which the authors named NullSeq. NullSeq [10] uses the maximum entropy approach where the synonymous codon usage probability is derived from a strict function that expresses the expected GC-content in the reference amino acid sequence. We ran both tools, CodSeqGen and NullSeq [10] , to generate 1000 coding sequences given the primary amino acid sequence and the target GC-content of the reference coding sequence. NullSeq: a tool for generating random coding sequences with desired amino acid and GC contents
      cache = ./cache/cord-287634-64zqe4cz.txt
       txt  = ./txt/cord-287634-64zqe4cz.txt
=== reduce.pl bib ===
         id = cord-304869-l6a68tqn
     author = Bielińska-Wąż, Dorota
      title = Graphical and numerical representations of DNA sequences: statistical aspects of similarity
       date = 2011-08-28
      pages = 
  extension = .txt
       mime = text/plain
      words = 15408
  sentences = 940
     flesch = 60
    summary = As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. How to restrict the graphs representing the sequences to two-dimensional plots and how to avoid degeneracies has been the subject of numerous studies which resulted in many graphical representations (see subsequent chapters). It is shown in the last chapter of this work that by using the four-component spectral representation one can recognize the difference in one base between a pair of sequences so it can be used for single nucleotide polymorfism (SNP) analyses which is subject of many investigation, as for example, in a recent work by Bhasi et al.
      cache = ./cache/cord-304869-l6a68tqn.txt
       txt  = ./txt/cord-304869-l6a68tqn.txt
=== reduce.pl bib ===
         id = cord-324216-ce3wa889
     author = Wang, Zheng
      title = Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses
       date = 2008-12-01
      pages = 
  extension = .txt
       mime = text/plain
      words = 5206
  sentences = 240
     flesch = 49
    summary = Due to the great genetic diversity of HRV and HEV, in order to ensure that designed probes (referred to as probe sequences) generated from selected database sequences (referred to as prototype regions) would detect and discriminate all serotypes of HRV and HEV, a predictive model was used to assist the microarray design [17] . This study demonstrated the use of an algorithm for the design of probe sets based on an in silico predictive model [17] , developed by our group, that minimized the probes needed for detection and identification of most serotypes of HRV and HEV. A powerful feature of the expanded RPM-Flu v.30/31 resequencing pathogen microarray is that the nucleotide sequences generated from hybridization of the sample RNA/DNA and array-bound probe sets in conjunction with previously developed sequence analysis algorithm CIBSI can be easily interpreted to make serotype or strain identifications.
      cache = ./cache/cord-324216-ce3wa889.txt
       txt  = ./txt/cord-324216-ce3wa889.txt
=== reduce.pl bib ===
=== reduce.pl bib ===
=== reduce.pl bib ===
=== reduce.pl bib ===
=== reduce.pl bib ===
=== reduce.pl bib ===
=== reduce.pl bib ===
         id = cord-023209-un2ysc2v
     author = nan
      title = Poster Presentations
       date = 2008-10-07
      pages = 
  extension = .txt
       mime = text/plain
      words = 111878
  sentences = 5398
     flesch = 45
    summary = Site-specifi c PEGylation of human IgG1-Fab using a rationally designed trypsin variant In the present contribution we report on a novel, highly selective biocatalytic method enabling C-terminal modifi cations of proteins with artifi cial functionalities under native state conditions. Recently, our group report a novel approach to a totally synthetic vaccine which consists of FMDV (Foot and Mouth Disease Virus) VP1 peptides, prepared by covalent conjugation of peptide biomolecules with membrane active carbochain polyelectrolytes In the present study, peptide epitops of VP1 protein both 135-161(P1) amino acid residues (Ser-Lys-Tyr-Ser-Thr-Thr-Gly-Glu-Arg-Thr-Arg-Thr-Arg-Gly-Asp-Leu-Gly-Ala-Leu-Ala-Ala-Arg-Val-Ala-Thr-Gln-Leu-Pro-Ala) and triptophan (Trp) containing on the N terminus 135-161 amino acid residues (Trp-135-161) (P2) were synthesized by using the microwave assisted solid-phase methods. Using as a template a peptide, already identifi ed, with agonist activity against PTPRJ(H-[Cys-His-His-Asn-Leu-Thr-His-Ala-Cys]-OH), here we report a structure-activity study carried out through endocyclic modifi cations (Ala-scan, D-substitutions, single residue deletions, substitutions of the disulfi de bridge) and the preliminary biological results of this set of compounds.
      cache = ./cache/cord-023209-un2ysc2v.txt
       txt  = ./txt/cord-023209-un2ysc2v.txt
=== reduce.pl bib ===
         id = cord-004879-pgyzluwp
     author = nan
      title = Programmed cell death
       date = 1994
      pages = 
  extension = .txt
       mime = text/plain
      words = 81677
  sentences = 4465
     flesch = 51
    summary = Furthermore kinetic experiments after complementation of HIV=RT p66 with KIV-RT pSl indicated that HIV-RT pSl can restore rate and extent of strand displacement activity by HIV-RT p66 compared to the HIV-RT heterodimer D66/D51, suggesting a function of the 51 kDa polypeptide, The mouse mammary tumor virus proviral DNA contains an open reading frame in the 3' long terminal repeat which can code for a 36 kDa polypeptide with a putative transmembrane sequence and five N-linked glycosylation sites. To this end we used constructs encoding the c-fos (and c-jun) genes fused to the hormone-binding domain of the human estrogen receptor, designated c-FosER (and c-JunER), We could show that short-term activation (30 mins.) of c-FosER by estradiole (E2) led to the disruption of epithelial cell polarity within 24 hours, as characterized by the expression of apical and basolateral marker proteins.
      cache = ./cache/cord-004879-pgyzluwp.txt
       txt  = ./txt/cord-004879-pgyzluwp.txt
=== reduce.pl bib ===
=== reduce.pl bib ===
=== reduce.pl bib ===
         id = cord-001835-0s7ok4uw
     author = nan
      title = Abstracts of the 29th Annual Symposium of The Protein Society
       date = 2015-10-01
      pages = 
  extension = .txt
       mime = text/plain
      words = 138514
  sentences = 6150
     flesch = 40
    summary = Altogether, these results indicate that, although PHDs might be more selective for HIF as a substrate as it was initially thought, the enzymatic activity of the prolyl hydroxylases is possibly influenced by a number of other proteins that can directly bind to PHDs. Non-natural aminoacids via the MIO-enzyme toolkit Alina Filip 1 , Judith H Bartha-V ari 1 , Gergely B an oczy 2 , L aszl o Poppe 2 , Csaba Paizs 1 , Florin-Dan Irimie 1 1 Biocatalysis and Biotransformation Research Group, Department of Chemistry, UBB, 2 Department of Organic Chemistry and Technology An attractive enzymatic route to enantiomerically pure to the highly valuable a-or b-aromatic amino acids involves the use of aromatic ammonia lyases (ALs) and aminomutases (AMs). Continuing our studies of the effect of like-charged residues on protein-folding mechanisms, in this work, we investigated, by means of NMR spectroscopy and molecular-dynamics simulations, two short fragments of the human Pin1 WW domain [hPin1(14-24); hPin1(15-23)] and one single point mutation system derived from hPin1(14-24) in which the original charged residues were replaced with non-polar alanine residues.
      cache = ./cache/cord-001835-0s7ok4uw.txt
       txt  = ./txt/cord-001835-0s7ok4uw.txt
=== reduce.pl bib ===
         id = cord-326225-crtpzad7
     author = Neill, John D.
      title = Simultaneous rapid sequencing of multiple RNA virus genomes
       date = 2014-06-01
      pages = 
  extension = .txt
       mime = text/plain
      words = 3804
  sentences = 204
     flesch = 55
    summary = This procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3′-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. There is a wealth of information in these isolates, but up till now, it has been time consuming and expensive to sequence these viral genomes, often requiring sets of strain-specific primers for PCR amplification and sequencing. These primers were developed so that the 20 base known sequence was used for PCR amplification of the library as well as served as a barcode for identifying each viral library following pooling and sequencing. This virus, a BVDV 1b strain isolated from alpaca (GenBank accession JX297520.1; Table 2 , library 3, barcode 10), was assembled from Ion Torrent data and was found to have only 1 base difference from the sequence determined earlier (data not shown). One virus, library 1, barcode 9, had only 658 viral sequence reads but 94.4% of the genome was assembled.
      cache = ./cache/cord-326225-crtpzad7.txt
       txt  = ./txt/cord-326225-crtpzad7.txt
=== reduce.pl bib ===
         id = cord-328644-odtue60a
     author = Comandatore, Francesco
      title = Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes
       date = 2020-05-28
      pages = 
  extension = .txt
       mime = text/plain
      words = 6535
  sentences = 301
     flesch = 50
    summary = These variants might arise during the spread of the epidemic, as viruses are known for their high frequency of mutation, particularly in single stranded RNA viruses -as in the case of SARS-CoV-2 (Sanjuán and Domingo-Calap 2016) , which has a single, positive-strand RNA genome. To have a better insight on the history and spread of the COVID-19 pandemic in Italy and thanks to the sequences deposited in the Gisaid database, we identified 7 non synonymous mutations that are differentially frequent in Italian SARS-CoV-2 strains respect to strains circulating globally. Our analysis allowed us to identify 7 positions in four proteins that present drastic changes in amino acid frequencies when comparing Italian sequences with worldwide sequences available on Gisaid.org on April, 10, 2020 ( Figure 1 ).
      cache = ./cache/cord-328644-odtue60a.txt
       txt  = ./txt/cord-328644-odtue60a.txt
=== reduce.pl bib ===
         id = cord-334394-qgyzk7th
     author = Edgar, Robert C.
      title = Petabase-scale sequence alignment catalyses viral discovery
       date = 2020-08-10
      pages = 
  extension = .txt
       mime = text/plain
      words = 8134
  sentences = 423
     flesch = 51
    summary = To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To expand the known repertoire of viruses and catalyse global virus discovery, in particular for Coronaviridae (CoV) family, we developed the Serratus cloud computing architecture for ultra-high throughput sequence alignment. We aligned 3,837,755 public RNA-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all GenBank CoV records clustered at 99% identity plus all non-retroviral RefSeq records for vertebrate viruses (see Methods and Extended Table 1 ). We performed de novo assembly on 52,772 runs potentially containing CoV sequencing reads by combining 37,131 SRA accessions identified by the Serratus search with 18,584 identified by an ongoing cataloguing initiative of the SRA called STAT [5] .
      cache = ./cache/cord-334394-qgyzk7th.txt
       txt  = ./txt/cord-334394-qgyzk7th.txt
=== reduce.pl bib ===
         id = cord-331698-rwow1ydx
     author = Latorre-Pérez, Adriel
      title = A lab in the field: applications of real-time, in situ metagenomic sequencing
       date = 2020-08-20
      pages = 
  extension = .txt
       mime = text/plain
      words = 6732
  sentences = 335
     flesch = 36
    summary = This review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis. Therefore, the ultra-portability, affordability, and speed in data production make the MinION technology suitable for real-time sequencing in a variety of environments, such as Ebola surveillance in West Africa during the last outbreak [25] , microbial communities inspection in the Arctic [26] , DNA sequencing on the International Space Station (ISS) [27] , and even the recently emerging pandemic coronavirus SARS-CoV-2 [28, 29] . In fact, there are some critical points to be addressed before this technique could become a standard in the industry: (i) sequencing cost should be reduced; (ii) rapid and reliable in situ DNA extraction and library preparation protocols should be designed and validated; (iii) minimal sequencing yields should be determined for each specific application; (iv) fast and real-time pipelines should be created and tested; and (v) level of expertise for managing the data and the samples should be notably reduced.
      cache = ./cache/cord-331698-rwow1ydx.txt
       txt  = ./txt/cord-331698-rwow1ydx.txt
=== reduce.pl bib ===
         id = cord-339209-oe8onyr9
     author = Vasilakis, Nikos
      title = Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range
       date = 2014-05-20
      pages = 
  extension = .txt
       mime = text/plain
      words = 5817
  sentences = 272
     flesch = 46
    summary = The organization of each genome was similar to that described previously for the mesoniviruses (NDiV, CavV, HanaV, NseV and MenoV), featuring a long 5'-untranslated region (5'-UTR) of 359 to 370 nt, six major long open reading frames (ORFs), and a long terminal region of 1780 to 1804 nt preceding the poly[A] tail ( Figure 2 ). To determine the phylogenetic relationships of the newly identified insect viruses, maximum likelihood (ML) phylogenetic trees were constructed based on the amino acid alignments of ORF2a (unprocessed S protein) and a concatenated region of the highly conserved domains within ORF1ab (3CL pro , RdRp and ZnHel1). A Clustal X alignment of the mesonivirus ORF3a proteins and individual structural analyses using SignalP and TMHMM and NetNGlyc (www.expasy.org) indicated that each is a class I transmembrane glycoprotein with a predicted N-termimal signal peptide, an ectodomain containing a conserved set of 6 cysteine residues and a single conserved N-glycosylation site, a transmembrane domain and a C-terminal cytoplasmic domain ( Figure 4A, 4D) .
      cache = ./cache/cord-339209-oe8onyr9.txt
       txt  = ./txt/cord-339209-oe8onyr9.txt
=== reduce.pl bib ===
         id = cord-334127-wjf8t8vp
     author = Brister, J. Rodney
      title = NCBI Viral Genomes Resource
       date = 2015-01-28
      pages = 
  extension = .txt
       mime = text/plain
      words = 3863
  sentences = 186
     flesch = 37
    summary = This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. Whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes (11) (12) (13) , sequence assembly and virus discovery (14) (15) (16) (17) , viral dynamics and evolution (18) (19) (20) and pathogen detection (14, (21) (22) (23) . The second model captures and standardizes host information for all viruses, and whenever a new RefSeq record is created, a manually curated 'viral host' property is assigned to the relevant species within the NCBI Taxonomy database. The link to the Retrovirus Resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the Retrovirus Genotyping Tool and HIV-1, Human Interaction Database (50, 51) .
      cache = ./cache/cord-334127-wjf8t8vp.txt
       txt  = ./txt/cord-334127-wjf8t8vp.txt
=== reduce.pl bib ===
         id = cord-348427-worgd0xu
     author = Hatcher, Eneida L.
      title = Virus Variation Resource – improved response to emergent viral outbreaks
       date = 2017-01-04
      pages = 
  extension = .txt
       mime = text/plain
      words = 5552
  sentences = 258
     flesch = 48
    summary = The resource now includes expanded data processing pipelines and analysis tools, and supports selection and retrieval of nucleotide and protein sequences from four new viral groups: Ebolaviruses, MERS coronavirus, rotavirus, and Zika virus ( Table 2 ). New processes have been added to parse source descriptor terms from Gen-Bank records and map these to controlled vocabulary, and the resource now supports retrieval of sequences based on standardized isolation source and host terms in addition to standardized gene and protein names. The resource includes data processing pipelines that retrieve sequences from GenBank, provide standardized gene and protein an-notation, and map sequence source descriptors (i.e. metadata) to uniform vocabularies. To resolve this issue, the Virus Variation database loading pipeline parses Gen-Bank records, identifies important metadata terms, such as sample isolation host, date, country and source, and maps these to a standardized vocabulary using a hierarchical approach.
      cache = ./cache/cord-348427-worgd0xu.txt
       txt  = ./txt/cord-348427-worgd0xu.txt
=== reduce.pl bib ===
         id = cord-340907-j9i1wlak
     author = Zarai, Yoram
      title = Evolutionary selection against short nucleotide sequences in viruses and their related hosts
       date = 2020-04-27
      pages = 
  extension = .txt
       mime = text/plain
      words = 8162
  sentences = 415
     flesch = 45
    summary = Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. Figure 3A and B depicts the average number of under-represented sequences of size m ¼ 3, 4, and 5 nucleotides, identified in few subsets of viruses in both the original and random variants of the virus. A sampling analysis that we performed (see Supplementary document, Section 2.8) suggests that the number of under-represented sequences identified in dsDNA viruses matches their genomic size, when compared with RNA viruses. To show that the correspondence between selection against short palindromic sequences in viruses and restriction sites cannot be explained by basic coding region features such as amino-acid content and order, codon usage bias and dinucleotide distribution, we also evaluated the overlap between restriction sites and common under-represented sequences of random variants of viruses.
      cache = ./cache/cord-340907-j9i1wlak.txt
       txt  = ./txt/cord-340907-j9i1wlak.txt
=== reduce.pl bib ===
         id = cord-341564-fvuwick5
     author = Qi, Zhao-Hui
      title = Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application
       date = 2018-06-12
      pages = 
  extension = .txt
       mime = text/plain
      words = 2647
  sentences = 178
     flesch = 54
    summary = From these, we can see that physicochemical properties are widely applied with graphical representation of protein sequences by these researchers and their results seem well. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. Therefore, to mine essential information from a protein sequence, we propose an effective graphical method combining physicochemical properties of amino acids and the BLOSUM62 matrix. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids
      cache = ./cache/cord-341564-fvuwick5.txt
       txt  = ./txt/cord-341564-fvuwick5.txt
=== reduce.pl bib ===
         id = cord-330067-ujhgb3b0
     author = Huang, Yi
      title = CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes
       date = 2007-10-02
      pages = 
  extension = .txt
       mime = text/plain
      words = 3007
  sentences = 168
     flesch = 55
    summary = To overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, CoVDB (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. CoVDB provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. In CoVDB, with the aim of facilitating gene retrieval, we tried to unify the naming of these non-structural proteins from different groups of coronaviruses. When we compared their putative amino acid sequences to the corresponding ones in other group 1 coronavirus genomes using BLAST, as well as searching for conserved domains using motifscan, results showed that the putative proteins encoded by these ORFs belonged to a protein family in Pfam originally assigned as 'Corona_NS3b' (accession number PF03053). database, CoVDB, of annotated coronavirus genes and genomes, which offers efficient batch sequence retrieval and analysis.
      cache = ./cache/cord-330067-ujhgb3b0.txt
       txt  = ./txt/cord-330067-ujhgb3b0.txt
=== reduce.pl bib ===
         id = cord-345552-h6fwi0qn
     author = Li, Q.-G.
      title = Hydropathic characteristics of adenovirus hexons
       date = 1997-07-01
      pages = 
  extension = .txt
       mime = text/plain
      words = 3522
  sentences = 206
     flesch = 53
    summary = The strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. The sequence of the predicted protein, consisting of 937 amino acids, was obtained with the LaserGene software program EditSeq. The hydropathy data of hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 were derived using the prediction method of Kyte-Doolittle in the LaserGene computer program Protean. The nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed serotypes of subgenera B, D and E to be closely related (Table 3 and Fig. 2) . DNA sequence of the adenovirus type 41 hexon gene and predicted structure of the protein
      cache = ./cache/cord-345552-h6fwi0qn.txt
       txt  = ./txt/cord-345552-h6fwi0qn.txt
=== reduce.pl bib ===
         id = cord-328259-3g4klpyg
     author = Guajardo-Leiva, Sergio
      title = Metagenomic Insights into the Sewage RNA Virosphere of a Large City
       date = 2020-09-21
      pages = 
  extension = .txt
       mime = text/plain
      words = 7626
  sentences = 370
     flesch = 47
    summary = Despite the overrepresentation of dsRNA viruses, our results show that Santiago's sewage RNA virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). Viral sequences identified as Partitiviridae-like viruses included in the "unclassified RNA viruses ShiM-2016" category in the NCBI taxonomy (~25% abundance; Figure 2B ) and Totiviriade family were also highly abundant in treated and untreated sewage samples from the EU [5, 7] . Therefore, the abundance of these viruses in the Trebal metagenome can expand the known sequence space associated with this family (only 10 genomes are currently available in the NCBI database) and contribute to a better understanding of the bacteriophage biology related to RNA genomes. Taken together, our results show that metagenomic surveys of RNA viruses in sewage samples and the use of HMMs could uncover extraordinary viral diversity through the detection of remote homologs in these human-impacted environments.
      cache = ./cache/cord-328259-3g4klpyg.txt
       txt  = ./txt/cord-328259-3g4klpyg.txt
=== reduce.pl bib ===
         id = cord-330312-1pjolkql
     author = Liu, Y.-T.
      title = Infectious Disease Genomics
       date = 2017-01-20
      pages = 
  extension = .txt
       mime = text/plain
      words = 5168
  sentences = 327
     flesch = 45
    summary = One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. 16, 17 The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002. 30e32 Genome-sequencing projects for other important human disease vectors are in progress. 38 One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. 48 The completed or ongoing genome projects (Table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. Genome sequence of the human malaria parasite Plasmodium falciparum
      cache = ./cache/cord-330312-1pjolkql.txt
       txt  = ./txt/cord-330312-1pjolkql.txt
=== reduce.pl bib ===
         id = cord-338207-60vrlrim
     author = Lefkowitz, E.J.
      title = Virus Databases
       date = 2008-07-30
      pages = 
  extension = .txt
       mime = text/plain
      words = 7957
  sentences = 368
     flesch = 48
    summary = (Each arrow points to the table containing the primary key.) Tables are color-coded according to the source of the information they contain: yellow, data obtained from the original GenBank sequence record and the ICTV Eighth Report; pink, data obtained from automated annotation or manual curation; blue, controlled vocabularies to ensure data consistency; green, administrative data. While most of us store our BLAST search results as files on our desktop computers, it is useful to store this information within the database to provide rapid access to similarity results for comparative purposes; to use these results to assign genes to orthologous families of related sequences; and to use these results in applications that analyze data in the database and, for example, display the results of an analysis between two or more types of viruses showing shared sets of common genes.
      cache = ./cache/cord-338207-60vrlrim.txt
       txt  = ./txt/cord-338207-60vrlrim.txt
=== reduce.pl bib ===
         id = cord-354465-5nqrrnqr
     author = Haslinger, Christian
      title = RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties
       date = 1999
      pages = 
  extension = .txt
       mime = text/plain
      words = 10341
  sentences = 756
     flesch = 67
    summary = Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. In case of one particular class of biopolymers, the ribonucleic acid (RNA) molecules, decoding of information stored in the sequence can be properly decomposed into two steps: (i) formation of the secondary structure, that is, of the pattern of Watson-Crick (and GU) base pairs, and (ii) the embedding of the contact structure in three-dimensional space. On the other hand, an increasing number of experimental findings, as well as results from comparative sequence analysis, suggest that pseudo-knots are important structural elements in many RNA molecules (Westhof and Jaeger, 1992) .
      cache = ./cache/cord-354465-5nqrrnqr.txt
       txt  = ./txt/cord-354465-5nqrrnqr.txt
=== reduce.pl bib ===
         id = cord-342785-55r01n0x
     author = Lemmon, Gordon H
      title = Predicting the sensitivity and specificity of published real-time PCR assays
       date = 2008-09-25
      pages = 
  extension = .txt
       mime = text/plain
      words = 4317
  sentences = 239
     flesch = 52
    summary = METHODS: We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. This analysis must include the predicted false negative and false positive rates for the developed signatures, and consider all available public sequence data. A freely available real time PCR analysis tool called TaqSim [4] was used to find public sequences that would match the primer/probe assay in question. However, according to the genomic data available, a better match of primers and probes to target is possible and is usually desired for high sensitivity detection. Current real-time PCR assay design approaches produce signatures with sensitivities generally too low for clinical use. Fifty Seven TaqMan PCR primer/probe combinations we predict to have higher sensitivity/specificity than current published assays. Development of quantitative gene-specific real-time RT-PCR assays for the detection of measles virus in clinical specimens
      cache = ./cache/cord-342785-55r01n0x.txt
       txt  = ./txt/cord-342785-55r01n0x.txt
=== reduce.pl bib ===
         id = cord-344782-ond1ziu5
     author = Zhang, Jing
      title = Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi)
       date = 2018-10-24
      pages = 
  extension = .txt
       mime = text/plain
      words = 6003
  sentences = 280
     flesch = 49
    summary = Nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. Following the detection of the novel virus, in November 2015 (about 6 months after the cessation of the outbreak) an intensive survey of the parts of the river where affected turtles had been detected [2] was undertaken by groups of biologists and ecologists and samples collected from a wide range of aquatic species and some terrestrial animals (n = 360) to establish the size of the remaining population and whether any other animals were carrying this virus. BRV, as a novel nidovirus, was isolated from tissues of diseased animals, very high levels of viral RNA were detected in tissues with marked pathological changes and in situ hybridisation assays demonstrated the presence of specific viral RNA in lesions in kidneys and eye tissue-two of the main affected organs.
      cache = ./cache/cord-344782-ond1ziu5.txt
       txt  = ./txt/cord-344782-ond1ziu5.txt
=== reduce.pl bib ===
         id = cord-339915-8j04y50s
     author = Deng, Wei
      title = DV-Curve Representation of Protein Sequences and Its Application
       date = 2014-05-08
      pages = 
  extension = .txt
       mime = text/plain
      words = 2946
  sentences = 176
     flesch = 49
    summary = Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins. In this paper, we introduce DV-curve graphical representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model of amino acids. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation New graphical representation of a DNA sequence based on the ordered dinucleotides and its application to sequence analysis Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation Similarity/dissimilarity studies of protein sequences based on a new 2d graphical representation
      cache = ./cache/cord-339915-8j04y50s.txt
       txt  = ./txt/cord-339915-8j04y50s.txt
=== reduce.pl bib ===
         id = cord-355075-ieb35upi
     author = Papenfuss, Anthony T
      title = The immune gene repertoire of an important viral reservoir, the Australian black flying fox
       date = 2012-06-20
      pages = 
  extension = .txt
       mime = text/plain
      words = 8952
  sentences = 480
     flesch = 54
    summary = alecto transcriptome provides information on a variety of immune genes not previously identified in any bat species and represents an important starting point for examining the antiviral activity of these molecules. To enrich for sequences corresponding to cytokines and innate immune genes, the second dataset was derived from pooled total RNA obtained from mitogen-stimulated spleen, white blood cells and lymph node and unstimulated thymus and bone marrow obtained from one pregnant female and one adult male flying fox. A full length transcript, encoding a 667 amino acid protein was identified in our bat transcriptome datasets and found to be orthologous to Mx1 based on comparison with known mammalian Mx1 and Mx2 family members (Figure 4a and data not shown). Genes involved in the adaptive immune system, including MHC class I and II genes and T and B cell receptors and co-receptors were highly represented in both the thymus and pooled datasets providing evidence that bats have all of the components necessary to mount an adaptive immune response.
      cache = ./cache/cord-355075-ieb35upi.txt
       txt  = ./txt/cord-355075-ieb35upi.txt
=== reduce.pl bib ===
         id = cord-353290-1wi1dhv6
     author = Kustin, Talia
      title = Biased mutation and selection in RNA viruses
       date = 2020-09-28
      pages = 
  extension = .txt
       mime = text/plain
      words = 7611
  sentences = 402
     flesch = 52
    summary = We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses. Nevertheless, RNA viruses do share several common features that drive their evolution: (a) their ultimate dependence on the cell, (b) their high mutation rates, (c) strong purifying selection derived from constraints operating on a small and densely coding genome, and (d) sporadic but powerful positive selection driven by an evolutionary arms race with the host they infect. Two non-mutually exclusive hypotheses may be put forth to explain the consistent pattern of A-richness that we observe: there is selection for more A in viral sequences, and/or there is a mutational bias that leads to more A in genomes of viruses.
      cache = ./cache/cord-353290-1wi1dhv6.txt
       txt  = ./txt/cord-353290-1wi1dhv6.txt
=== reduce.pl bib ===
         id = cord-343863-q1y8uscj
     author = Whitney, Joe
      title = Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches
       date = 2005-02-08
      pages = 
  extension = .txt
       mime = text/plain
      words = 3463
  sentences = 179
     flesch = 61
    summary = ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. The complete ReHAB hits database can then be queried by date using a simple GUI to allow the researcher to easily identify new hits; these are highlighted, and pairwise or multiple alignments can be performed to assess the quality of the match. ReHAB consists of four main components ( Figure 1 ): (1) a MySQL relational database that stores information about hits, including biological sequences, alignments between them, and other categorization and annotation data; (2) a Java server that provides access to programs which cannot be run locally by the client on arbitrary user workstations, such as NCBI BLAST and EMBOSS [12] utilities; (3) a Java Swing graphical client, downloaded and launched on client machines using Java Web Start; (4) and a back-end Java program which runs alignment programs and compiles results in the database.
      cache = ./cache/cord-343863-q1y8uscj.txt
       txt  = ./txt/cord-343863-q1y8uscj.txt
=== reduce.pl bib ===
         id = cord-341879-vubszdp2
     author = Li, Lucy M
      title = Genomic analysis of emerging pathogens: methods, application and future trends
       date = 2014-11-22
      pages = 
  extension = .txt
       mime = text/plain
      words = 5029
  sentences = 253
     flesch = 36
    summary = In this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases. In this review, we provide an overview of recent developments in genomic methods in the context of infectious diseases, evaluate integrative methods that incorporate genetic data in epidemiological analysis, and discuss the application of these methods to EIDs. Over the last two decades, sequence data have increased in quality, length and volume due to improvements in the underlying technology and decreasing costs. In recent cases of EIDs, genomic data have helped to classify and characterize the pathogen, uncover the population history of the disease, and produce estimates of epidemiological parameters. Just as compartmental models can be fitted to surveillance data to infer the epidemiological dynamics of an infectious disease (Box 1), the coalescent framework allows inference of population history from pathogen sequences.
      cache = ./cache/cord-341879-vubszdp2.txt
       txt  = ./txt/cord-341879-vubszdp2.txt
===== Reducing email addresses
cord-035033-osjy88rc
cord-265857-fs6dj3dp
cord-263987-ff6kor0c
cord-321386-u1imic5l
cord-267500-x3u9i1vq
cord-321150-ev6acl7b
cord-001835-0s7ok4uw
cord-348427-worgd0xu
Creating transaction
Updating adr table
===== Reducing keywords
cord-000257-ampip7od
cord-016293-pyb00pt5
cord-016798-tv2ntug6
cord-025610-7vouj8pp
cord-000473-jpow6iw1
cord-025948-6dsx7pey
cord-014674-ey29970v
cord-004862-yv76yvy5
cord-018459-isbc1r2o
cord-015850-ef6svn8f
cord-012975-u87ol3fs
cord-033010-o5kiadfm
cord-256608-ajzk86rq
cord-103029-nc5yf6x4
cord-010260-8lnpujip
cord-001340-kqcx7lrq
cord-010161-bcuec2fz
cord-002473-2kpxhzbe
cord-005060-n901y2d4
cord-017584-9rx4jlw8
cord-011565-8ncgldaq
cord-001537-i34vmfpp
cord-256278-jvfjf7aw
cord-103297-4stnx8dw
cord-000642-mkwpuav6
cord-255194-4i9fc0r7
cord-016594-lj0us1dq
cord-023647-dlqs8ay9
cord-264296-0x90yubt
cord-022348-w7z97wir
cord-264135-s2u76pvk
cord-266288-buc4dd5y
cord-203232-1nnqx1g9
cord-035033-osjy88rc
cord-266960-kyx6xhvj
cord-018133-2otxft31
cord-003316-r5te5xob
cord-001786-ybd8hi8y
cord-300796-rmjv56ia
cord-017932-vmtjc8ct
cord-265857-fs6dj3dp
cord-010499-yefxrj30
cord-263987-ff6kor0c
cord-010273-0c56x9f5
cord-022494-d66rz6dc
cord-193910-7p3f3znj
cord-017354-cndb031c
cord-253436-dz84icdc
cord-255371-o9oxchq6
cord-014462-11ggaqf1
cord-268549-2lg8i9r1
cord-014461-2ubh9u8r
cord-001974-wjf3c7a7
cord-306725-0vam15pt
cord-321386-u1imic5l
cord-023208-w99gc5nx
cord-275258-azpg5yrh
cord-027316-echxuw74
cord-264746-gfn312aa
cord-213136-euv6pqh5
cord-031957-df4luh5v
cord-321715-bkfkmtld
cord-193356-hqbstgg7
cord-252347-vnn4135b
cord-311240-o0zyt2vb
cord-018963-2lia97db
cord-267500-x3u9i1vq
cord-102766-n6mpdhyu
cord-311839-61djk4bs
cord-254942-g51mjj2b
cord-302798-q0mbngqy
cord-321150-ev6acl7b
cord-321762-7kiahjyy
cord-266794-oyppubq5
cord-300807-9u8idlon
cord-325985-xfzhn1n1
cord-280881-5o38ihe0
cord-274056-9t3kneoo
cord-301827-a7hnuxy5
cord-279528-41atidai
cord-287658-c2lljdi7
cord-300149-djclli8n
cord-268467-btfz6ye8
cord-287634-64zqe4cz
cord-310734-6v7oru2l
cord-304869-l6a68tqn
cord-324216-ce3wa889
cord-291156-zxg3dsm3
cord-296691-cg463fbn
cord-302161-ytr7ds8i
cord-023209-un2ysc2v
cord-325043-vqjhiv7p
cord-004879-pgyzluwp
cord-325750-x7jpsnxg
cord-001835-0s7ok4uw
cord-326225-crtpzad7
cord-328644-odtue60a
cord-324021-y1vr1db0
cord-334394-qgyzk7th
cord-331698-rwow1ydx
cord-338207-60vrlrim
cord-330067-ujhgb3b0
cord-334127-wjf8t8vp
cord-348427-worgd0xu
cord-339209-oe8onyr9
cord-345552-h6fwi0qn
cord-340907-j9i1wlak
cord-304607-td0776wj
cord-341564-fvuwick5
cord-328259-3g4klpyg
cord-354465-5nqrrnqr
cord-342785-55r01n0x
cord-330312-1pjolkql
cord-339915-8j04y50s
cord-355075-ieb35upi
cord-344782-ond1ziu5
cord-353290-1wi1dhv6
cord-343863-q1y8uscj
cord-341879-vubszdp2
Creating transaction
Updating wrd table
===== Reducing urls
cord-016798-tv2ntug6
cord-025948-6dsx7pey
cord-000473-jpow6iw1
cord-015850-ef6svn8f
cord-033010-o5kiadfm
cord-256608-ajzk86rq
cord-002473-2kpxhzbe
cord-011565-8ncgldaq
cord-001537-i34vmfpp
cord-256278-jvfjf7aw
cord-103297-4stnx8dw
cord-000642-mkwpuav6
cord-016594-lj0us1dq
cord-264135-s2u76pvk
cord-264296-0x90yubt
cord-266288-buc4dd5y
cord-018133-2otxft31
cord-003316-r5te5xob
cord-017932-vmtjc8ct
cord-022494-d66rz6dc
cord-255371-o9oxchq6
cord-001974-wjf3c7a7
cord-275258-azpg5yrh
cord-306725-0vam15pt
cord-193356-hqbstgg7
cord-264746-gfn312aa
cord-267500-x3u9i1vq
cord-311240-o0zyt2vb
cord-311839-61djk4bs
cord-018963-2lia97db
cord-102766-n6mpdhyu
cord-254942-g51mjj2b
cord-321150-ev6acl7b
cord-302798-q0mbngqy
cord-266794-oyppubq5
cord-280881-5o38ihe0
cord-274056-9t3kneoo
cord-325985-xfzhn1n1
cord-301827-a7hnuxy5
cord-300149-djclli8n
cord-296691-cg463fbn
cord-324216-ce3wa889
cord-302161-ytr7ds8i
cord-291156-zxg3dsm3
cord-310734-6v7oru2l
cord-304607-td0776wj
cord-325750-x7jpsnxg
cord-001835-0s7ok4uw
cord-326225-crtpzad7
cord-328644-odtue60a
cord-334394-qgyzk7th
cord-330067-ujhgb3b0
cord-339209-oe8onyr9
cord-334127-wjf8t8vp
cord-348427-worgd0xu
cord-354465-5nqrrnqr
cord-341564-fvuwick5
cord-328259-3g4klpyg
cord-342785-55r01n0x
cord-344782-ond1ziu5
cord-355075-ieb35upi
cord-353290-1wi1dhv6
Creating transaction
Updating url table
===== Reducing named entities
cord-000257-ampip7od
cord-016798-tv2ntug6
cord-000473-jpow6iw1
cord-016293-pyb00pt5
cord-025610-7vouj8pp
cord-014674-ey29970v
cord-025948-6dsx7pey
cord-004862-yv76yvy5
cord-018459-isbc1r2o
cord-015850-ef6svn8f
cord-012975-u87ol3fs
cord-033010-o5kiadfm
cord-256608-ajzk86rq
cord-103029-nc5yf6x4
cord-001340-kqcx7lrq
cord-002473-2kpxhzbe
cord-010260-8lnpujip
cord-010161-bcuec2fz
cord-017584-9rx4jlw8
cord-005060-n901y2d4
cord-001537-i34vmfpp
cord-011565-8ncgldaq
cord-103297-4stnx8dw
cord-256278-jvfjf7aw
cord-000642-mkwpuav6
cord-255194-4i9fc0r7
cord-023647-dlqs8ay9
cord-016594-lj0us1dq
cord-022348-w7z97wir
cord-264296-0x90yubt
cord-264135-s2u76pvk
cord-203232-1nnqx1g9
cord-266288-buc4dd5y
cord-035033-osjy88rc
cord-266960-kyx6xhvj
cord-001786-ybd8hi8y
cord-003316-r5te5xob
cord-018133-2otxft31
cord-017932-vmtjc8ct
cord-300796-rmjv56ia
cord-265857-fs6dj3dp
cord-010273-0c56x9f5
cord-010499-yefxrj30
cord-263987-ff6kor0c
cord-022494-d66rz6dc
cord-193910-7p3f3znj
cord-253436-dz84icdc
cord-255371-o9oxchq6
cord-017354-cndb031c
cord-014461-2ubh9u8r
cord-268549-2lg8i9r1
cord-275258-azpg5yrh
cord-001974-wjf3c7a7
cord-027316-echxuw74
cord-014462-11ggaqf1
cord-321386-u1imic5l
cord-306725-0vam15pt
cord-213136-euv6pqh5
cord-252347-vnn4135b
cord-264746-gfn312aa
cord-193356-hqbstgg7
cord-267500-x3u9i1vq
cord-311240-o0zyt2vb
cord-031957-df4luh5v
cord-321715-bkfkmtld
cord-311839-61djk4bs
cord-018963-2lia97db
cord-321762-7kiahjyy
cord-102766-n6mpdhyu
cord-254942-g51mjj2b
cord-321150-ev6acl7b
cord-302798-q0mbngqy
cord-266794-oyppubq5
cord-300807-9u8idlon
cord-023208-w99gc5nx
cord-280881-5o38ihe0
cord-274056-9t3kneoo
cord-325985-xfzhn1n1
cord-279528-41atidai
cord-300149-djclli8n
cord-268467-btfz6ye8
cord-287658-c2lljdi7
cord-301827-a7hnuxy5
cord-304869-l6a68tqn
cord-287634-64zqe4cz
cord-324216-ce3wa889
cord-296691-cg463fbn
cord-302161-ytr7ds8i
cord-291156-zxg3dsm3
cord-304607-td0776wj
cord-310734-6v7oru2l
cord-325043-vqjhiv7p
cord-325750-x7jpsnxg
cord-324021-y1vr1db0
cord-326225-crtpzad7
cord-328644-odtue60a
cord-334394-qgyzk7th
cord-331698-rwow1ydx
cord-338207-60vrlrim
cord-330067-ujhgb3b0
cord-341564-fvuwick5
cord-345552-h6fwi0qn
cord-334127-wjf8t8vp
cord-348427-worgd0xu
cord-340907-j9i1wlak
cord-339209-oe8onyr9
cord-342785-55r01n0x
cord-328259-3g4klpyg
cord-354465-5nqrrnqr
cord-344782-ond1ziu5
cord-330312-1pjolkql
cord-339915-8j04y50s
cord-355075-ieb35upi
cord-343863-q1y8uscj
cord-341879-vubszdp2
cord-353290-1wi1dhv6
cord-004879-pgyzluwp
cord-023209-un2ysc2v
cord-001835-0s7ok4uw
Creating transaction
Updating ent table
===== Reducing parts of speech
cord-000257-ampip7od
cord-025610-7vouj8pp
cord-000473-jpow6iw1
cord-014674-ey29970v
cord-016798-tv2ntug6
cord-018459-isbc1r2o
cord-004862-yv76yvy5
cord-012975-u87ol3fs
cord-025948-6dsx7pey
cord-256608-ajzk86rq
cord-015850-ef6svn8f
cord-001340-kqcx7lrq
cord-033010-o5kiadfm
cord-002473-2kpxhzbe
cord-103029-nc5yf6x4
cord-017584-9rx4jlw8
cord-010161-bcuec2fz
cord-005060-n901y2d4
cord-001537-i34vmfpp
cord-256278-jvfjf7aw
cord-255194-4i9fc0r7
cord-023647-dlqs8ay9
cord-000642-mkwpuav6
cord-016293-pyb00pt5
cord-264296-0x90yubt
cord-264135-s2u76pvk
cord-203232-1nnqx1g9
cord-001786-ybd8hi8y
cord-011565-8ncgldaq
cord-266288-buc4dd5y
cord-010260-8lnpujip
cord-022348-w7z97wir
cord-035033-osjy88rc
cord-016594-lj0us1dq
cord-018133-2otxft31
cord-265857-fs6dj3dp
cord-266960-kyx6xhvj
cord-300796-rmjv56ia
cord-003316-r5te5xob
cord-017932-vmtjc8ct
cord-010273-0c56x9f5
cord-010499-yefxrj30
cord-263987-ff6kor0c
cord-193910-7p3f3znj
cord-253436-dz84icdc
cord-255371-o9oxchq6
cord-014461-2ubh9u8r
cord-268549-2lg8i9r1
cord-275258-azpg5yrh
cord-022494-d66rz6dc
cord-001974-wjf3c7a7
cord-306725-0vam15pt
cord-321386-u1imic5l
cord-027316-echxuw74
cord-213136-euv6pqh5
cord-017354-cndb031c
cord-103297-4stnx8dw
cord-252347-vnn4135b
cord-267500-x3u9i1vq
cord-311240-o0zyt2vb
cord-102766-n6mpdhyu
cord-321150-ev6acl7b
cord-266794-oyppubq5
cord-300807-9u8idlon
cord-311839-61djk4bs
cord-264746-gfn312aa
cord-321715-bkfkmtld
cord-254942-g51mjj2b
cord-321762-7kiahjyy
cord-302798-q0mbngqy
cord-280881-5o38ihe0
cord-274056-9t3kneoo
cord-325985-xfzhn1n1
cord-031957-df4luh5v
cord-279528-41atidai
cord-300149-djclli8n
cord-268467-btfz6ye8
cord-287658-c2lljdi7
cord-287634-64zqe4cz
cord-324216-ce3wa889
cord-296691-cg463fbn
cord-291156-zxg3dsm3
cord-018963-2lia97db
cord-304607-td0776wj
cord-302161-ytr7ds8i
cord-325043-vqjhiv7p
cord-310734-6v7oru2l
cord-014462-11ggaqf1
cord-325750-x7jpsnxg
cord-328644-odtue60a
cord-304869-l6a68tqn
cord-301827-a7hnuxy5
cord-193356-hqbstgg7
cord-324021-y1vr1db0
cord-334394-qgyzk7th
cord-326225-crtpzad7
cord-331698-rwow1ydx
cord-338207-60vrlrim
cord-330067-ujhgb3b0
cord-334127-wjf8t8vp
cord-339209-oe8onyr9
cord-348427-worgd0xu
cord-345552-h6fwi0qn
cord-341564-fvuwick5
cord-340907-j9i1wlak
cord-330312-1pjolkql
cord-342785-55r01n0x
cord-328259-3g4klpyg
cord-339915-8j04y50s
cord-344782-ond1ziu5
cord-343863-q1y8uscj
cord-341879-vubszdp2
cord-355075-ieb35upi
cord-353290-1wi1dhv6
cord-354465-5nqrrnqr
cord-023208-w99gc5nx
cord-004879-pgyzluwp
cord-023209-un2ysc2v
cord-001835-0s7ok4uw
Creating transaction
Updating pos table
Building ./etc/reader.txt
cord-001835-0s7ok4uw
cord-301827-a7hnuxy5
cord-023209-un2ysc2v
cord-023209-un2ysc2v
cord-023208-w99gc5nx
cord-001835-0s7ok4uw
                number of items: 118
                   sum of words: 1,037,270
          average size in words: 9,973
      average readability score: 51

                          nouns: sequence; sequences; protein; proteins; virus; structure; data; analysis; genome; peptides; dna; gene; peptide; number; acid; cell; viruses; amino; cells; results; methods; activity; method; genes; model; alignment; structures; information; time; study; species; sequencing; studies; residues; region; acids; database; approach; function; type; genomes; domain; similarity; disease; samples; receptor; length; group; order; expression
                          verbs: used; shown; based; bind; found; contain; identifying; including; provided; known; obtaining; represented; determined; compare; suggests; given; generating; developed; indicate; increasing; following; performed; describe; see; predict; allowing; involved; make; revealed; leads; associated; form; studying; considered; observe; reported; detecting; result; produce; required; propose; related; expressed; induced; characterize; isolated; cause; investigated; defined; applied
                     adjectives: different; viral; new; human; high; specific; molecular; structural; many; large; similar; important; several; first; biological; single; multiple; novel; non; immune; small; functional; possible; available; nucleotide; various; present; low; genetic; genomic; phylogenetic; common; secondary; positive; active; major; complete; higher; like; particular; short; potential; unique; evolutionary; dependent; clinical; long; free; amino; natural
                        adverbs: also; however; well; highly; therefore; respectively; previously; even; recently; often; furthermore; first; now; currently; still; together; directly; far; rather; finally; much; significantly; specifically; moreover; closely; relatively; less; especially; generally; clearly; widely; usually; approximately; already; almost; yet; subsequently; randomly; hence; completely; fully; additionally; instead; interestingly; strongly; rapidly; potentially; particularly; typically; successfully
                       pronouns: we; it; their; its; our; they; i; them; us; his; one; he; itself; themselves; your; my; you; her; him; she; me; ourselves; yÞ; mine; l1oc; himself; s; ppifs; p53-mdm2; p450s; n40np; ifnyr-/-mice; https://github.com/ababaian/serratus; em; cb562; ³hser; yegfp; y_~; y401; y; w@; u; tlg1; sod-3::gfp; pgem2dhfr; p110a; ours; organotyp[c; n−3; nthash
                   proper nouns: RNA; C; Fig; SARS; PCR; Table; A; N; DNA; T; S; Genome; University; NMR; II; DeepRC; M; ±; NCBI; Protein; CoV-2; HCV; B; L; fl; HIV; K; E.; D; GenBank; Virus; India; Human; bp; LSTM; Institute; F; CNN; RT; China; MS; Gly; G; C.; novo; mRNA; Hopfield; L1; Analysis; CoV
                       keywords: sequence; rna; dna; protein; virus; genome; structure; sars; gene; model; pcr; viral; study; human; cell; acid; university; table; result; peptide; nmr; high; disease; bind; activity; sequencing; receptor; plant; ncbi; method; interaction; india; cnn; cmv; vaccine; tyr; site; residue; read; probe; pro; phylogenetic; orf; mutation; mil; mhc; metagenomic; lys; lstm; isolate

       one topic; one dimension: sequence
                        file(s): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945003/
                      titles(s): The Nature of Protein Domain Evolution: Shaping the Interaction Network

    three topics; one dimension: sequence; protein; virus
                        file(s): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7261164/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7167823/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3639731/
                      titles(s): To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics | Poster Presentations | Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh

  five topics; three dimensions: sequence sequences virus; protein proteins binding; peptide peptides activity; sequence sequences protein; structures secondary rna
                        file(s): https://doi.org/10.3390/v12040422, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7087532/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7167823/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7123984/, https://www.ncbi.nlm.nih.gov/pubmed/17883226/
                      titles(s): A Preliminary Study of the Virome of the South American Free-Tailed Bats (Tadarida brasiliensis) and Identification of Two Novel Mammalian Viruses | Programmed cell death | Poster Presentations | Protein Structure Prediction by Protein Threading | RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties

      Type: cord
     title: keyword-sequence-cord
      date: 2021-05-25
      time: 16:43
  username: emorgan
    patron: Eric Morgan
     email: emorgan@nd.edu
     input: keywords:sequence
==== make-pages.sh htm files
==== make-pages.sh complex files
==== make-pages.sh named enities
==== making bibliographics
         id: cord-274056-9t3kneoo
     author: Abd Elwahaab, Marwa A.
      title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector
       date: 2019-05-08
      words: 3314.0
  sentences: 251.0
      pages: 
     flesch: 59.0
      cache: ./cache/cord-274056-9t3kneoo.txt
        txt: ./txt/cord-274056-9t3kneoo.txt
    summary: title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector For beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in Table 1 . The similarity/dissimilarity vectors that are corresponding to beta globin, ND5, and spike protein sequences are illustrated in Tables 9, 10, and 11, respectively, based on the two methods discussed before. The results in Table 10 show that both the magnitude ( 5 ) and the angle ( 5 ) can measure similarity/dissimilarity degree well among ND5 protein sequences as shown in Figure 2 . The similarity/dissimilarity analysis among the seven beta globin sequences measured according to ( 5 ) is illustrated in Table 12 and shown in Figure 4 . The similarity/dissimilarity analysis among the beta globin sequences measured according to (GR spike ) is illustrated in Table 14 and shown in Figure 6 .
   abstract: Similarity/dissimilarity analysis is a key way of understanding the biology of an organism by knowing the origin of the new genes/sequences. Sequence data are grouped in terms of biological relationships. The number of sequences related to any group is susceptible to be increased every day. All the present alignment-free methods approve the utility of their approaches by producing a similarity/dissimilarity matrix. Although this matrix is clear, it measures the degree of similarity among sequences individually. In our work, a representative of each of three groups of protein sequences is introduced. A similarity/dissimilarity vector is evaluated instead of the ordinary similarity/dissimilarity matrix based on the group representative. The approach is applied on three selected groups of protein sequences: beta globin, NADH dehydrogenase subunit 5 (ND5), and spike protein sequences. A cross-grouping comparison is produced to ensure the singularity of each group. A qualitative comparison between our approach, previous articles, and the phylogenetic tree of these protein sequences proved the utility of our approach.
        url: https://doi.org/10.1155/2019/8702968
        doi: 10.1155/2019/8702968

         id: cord-279528-41atidai
     author: Abo-Elkhier, Mervat M.
      title: Measuring Similarity among Protein Sequences Using a New Descriptor
       date: 2019-11-22
      words: 3045.0
  sentences: 217.0
      pages: 
     flesch: 57.0
      cache: ./cache/cord-279528-41atidai.txt
        txt: ./txt/cord-279528-41atidai.txt
    summary: Each amino acid in the protein sequence is represented by a number, and a new 2D graphical representation is suggested. A new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (A t , SA t ). e 2D graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in e 2D graphical representation of TGEVG from class I and GD03T0013 from SARS_CoV protein sequences is illustrated in Figures 4(a) and 4(b) respectively. A new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean A t and standard deviation SA t of the combined intensity level value A t (i) of the protein sequence. F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids
   abstract: The comparison of protein sequences according to similarity is a fundamental aspect of today's biomedical research. With the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. Famous sequences' comparison methods are alignment based. They generally give excellent results when the sequences under study are closely related and they are time consuming. Herein, a new alignment-free method is introduced. Our technique depends on a new graphical representation and descriptor. The graphical representation of protein sequence is a simple way to visualize protein sequences. The descriptor compresses the primary sequence into a single vector composed of only two values. Our approach gives good results with both short and long sequences within a little computation time. It is applied on nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 spike protein sequences. Correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others' approaches, results, and sequence homology.
        url: https://www.ncbi.nlm.nih.gov/pubmed/31886192/
        doi: 10.1155/2019/2796971

         id: cord-287634-64zqe4cz
     author: Al-Ssulami, Abdulrakeeb M.
      title: CodSeqGen: A tool for generating synonymous coding sequences with desired GC-contents
       date: 2020-01-31
      words: 2307.0
  sentences: 137.0
      pages: 
     flesch: 59.0
      cache: ./cache/cord-287634-64zqe4cz.txt
        txt: ./txt/cord-287634-64zqe4cz.txt
    summary: For generating synthetic coding sequences with pre-specified amino acid sequence and desired GC-content, there exist two stochastic methods, multinomial and maximum entropy. In this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired GC-content. Thus, identifying over/under-represented regulatory elements or genome-scale patterns relies on generating random sequences that obey the pre-specified amino acid sequence and GC-content constraints. A more restricted method was presented recently, which the authors named NullSeq. NullSeq [10] uses the maximum entropy approach where the synonymous codon usage probability is derived from a strict function that expresses the expected GC-content in the reference amino acid sequence. We ran both tools, CodSeqGen and NullSeq [10] , to generate 1000 coding sequences given the primary amino acid sequence and the target GC-content of the reference coding sequence. NullSeq: a tool for generating random coding sequences with desired amino acid and GC contents
   abstract: Abstract Identification of regulatory elements is essential for understanding the mechanism behind regulating gene expression. These regulatory elements—located in or near gene—bind to proteins called transcription factors to initiate the transcription process. Their occurrences are influenced by the GC-content or nucleotide composition. For generating synthetic coding sequences with pre-specified amino acid sequence and desired GC-content, there exist two stochastic methods, multinomial and maximum entropy. Both methods rely on the probability of choosing the codon synonymous for usage in regard to a specific amino acid. In spite the latter exhibited unbiased manner, the produced sequences are not exactly obeying the GC-content constraint. In this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired GC-content. The proposed tool, namely CodSeqGen, depends on random selection for smaller subsets to be traversed using the backtracking approach.
        url: https://doi.org/10.1016/j.ygeno.2019.02.002
        doi: 10.1016/j.ygeno.2019.02.002

         id: cord-102766-n6mpdhyu
     author: Alam, Md. Nafis Ul
      title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses
       date: 2020-06-25
      words: 3193.0
  sentences: 192.0
      pages: 
     flesch: 56.0
      cache: ./cache/cord-102766-n6mpdhyu.txt
        txt: ./txt/cord-102766-n6mpdhyu.txt
    summary: title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. VirFinder: a novel k-mer based tool for identifying viral sequences from 558 assembled metagenomic data.
   abstract: High throughout sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de-novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data. Author Summary In this age of high-throughput sequencing, proper classification of copious amounts of sequence data remains to be a daunting challenge. Presently, sequence alignment methods are immediately assigned to the task. Owing to the selection forces of nature, there is considerable homology even between the sequences of different species which draws ambiguity to the results of alignment-based searches. Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. These features were able to accurately distinguish virus RNA from human transcripts with performance scores better than all previous reports. Our models were able to generalize well to distant species of viruses and mouse transcripts. The model correctly classifies the majority of false hits generated by current standard alignment tools. These findings strongly imply that this k-mer score based computational pipeline forges a highly informative, rich set of numerical machine learning features and similar pipelines can greatly advance the field of computational biology.
        url: https://doi.org/10.1101/2020.06.25.170779
        doi: 10.1101/2020.06.25.170779

         id: cord-018133-2otxft31
     author: Altman, Russ B.
      title: Bioinformatics
       date: 2006
      words: 9592.0
  sentences: 462.0
      pages: 
     flesch: 46.0
      cache: ./cache/cord-018133-2otxft31.txt
        txt: ./txt/cord-018133-2otxft31.txt
    summary: Experimentation and bioinformatics have divided the research into several areas, and the largest are: (1) genome and protein sequence analysis, (2) macromolecular structure-function analysis, (3) gene expression analysis, and (4) proteomics. With the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. The Entrez system from the National Center for Biological Information (NCBI) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the Human Genome Project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources.
   abstract: Why is sequence, structure, and biological pathway information relevant to medicine? Where on the Internet should you look for a DNA sequence, a protein sequence, or a protein structure? What are two problems encountered in analyzing biological sequence, structure, and function? How has the age of genomics changed the landscape of bioinformatics? What two changes should we anticipate in the medical record as a result of these new information sources? What are two computational challenges in bioinformatics for the future?
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7122933/
        doi: 10.1007/0-387-36278-9_22

         id: cord-010260-8lnpujip
     author: Anthonsen, Henrik W.
      title: The blind watchmaker and rational protein engineering
       date: 1994-08-31
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: In the present review some scientific areas of key importance for protein engineering are discussed, such as problems involved in deducting protein sequence from DNA sequence (due to posttranscriptional editing, splicing and posttranslational modifications), modelling of protein structures by homology, NMR of large proteins (including probing the molecular surface with relaxation agents), simulation of protein structures by molecular dynamics and simulation of electrostatic effects in proteins (including pH-dependent effects). It is argued that all of these areas could be of key importance in most protein engineering projects, because they give access to increased and often unique information. In the last part of the review some potential areas for future applications of protein engineering approaches are discussed, such as non-conventional media, de novo design and nanotechnology.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7173218/
        doi: 10.1016/0168-1656(94)90152-x

         id: cord-000473-jpow6iw1
     author: Astrovskaya, Irina
      title: Inferring viral quasispecies spectra from 454 pyrosequencing reads
       date: 2011-07-28
      words: 5363.0
  sentences: 296.0
      pages: 
     flesch: 54.0
      cache: ./cache/cord-000473-jpow6iw1.txt
        txt: ./txt/cord-000473-jpow6iw1.txt
    summary: High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Given a collection of 454 pyrosequencing reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population.
   abstract: BACKGROUND: RNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. RESULTS: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Experimental results show that ViSpA outperforms ShoRAH on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. While ShoRAH has a significant advantage over ViSpA on reads simulated with sequencing errors due to its advanced error correction algorithm, ViSpA is better at assembling the simulated reads after they have been corrected by ShoRAH. ViSpA also outperforms ShoRAH on real 454 reads. Indeed, 7 most frequent sequences reconstructed by ViSpA from a real HCV dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and Sanger sequencing. In contrast, only one of the sequences reconstructed by ShoRAH is viable. On a real HIV dataset, ShoRAH correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas ViSpA correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. ViSpA source code is available at http://alla.cs.gsu.edu/~software/VISPA/vispa.html. CONCLUSIONS: ViSpA enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. We are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3194189/
        doi: 10.1186/1471-2105-12-s6-s1

         id: cord-035033-osjy88rc
     author: Aydin, Berkay
      title: Spatiotemporal event sequence discovery without thresholds
       date: 2020-11-09
      words: 8231.0
  sentences: 430.0
      pages: 
     flesch: 54.0
      cache: ./cache/cord-035033-osjy88rc.txt
        txt: ./txt/cord-035033-osjy88rc.txt
    summary: Here, we introduce a novel algorithm, RAND-ESMINER, which, by randomly repeating the mining process on a random subset of instances and follow relationships, finds an estimate participation index for event sequences. The RAND-ESMINER uses our pattern growth-based ESGROWTH algorithm [4] as the backbone, where the follow relationships are translated into a directed acyclic graph structure, and randomly permutes the edges of this graph to mine the event sequences. They defined a follow relation between the pointbased event instances of two different event types, presented significance measures for sequences, and introduced two pattern-growth based algorithms for the mining task. In this paper, we will focus on mining STESs using a randomization approach, which will take a set of spatiotemporal event instances as input and returns all the discovered STESs together with a list of estimated participation index values for each STES, obtained from randomized trials.
   abstract: Spatiotemporal event sequences (STESs) are the ordered series of event types whose instances frequently follow each other in time and are located close-by. An STES is a spatiotemporal frequent pattern type, which is discovered from moving region objects whose polygon-based locations continiously evolve over time. Previous studies on STES mining require significance and prevalence thresholds for the discovery, which is usually unknown to domain experts. The quality of the discovered sequences is of great importance to the domain experts who use these algorithms. We introduce a novel algorithm to find the most relevant STESs without threshold values. We tested the relevance and performance of our threshold-free algorithm with a case study on solar event metadata, and compared the results with the previous STES mining algorithms.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7649715/
        doi: 10.1007/s10707-020-00427-6

         id: cord-000257-ampip7od
     author: Bagowski, Christoph P
      title: The Nature of Protein Domain Evolution: Shaping the Interaction Network
       date: 2010-08-17
      words: 4678.0
  sentences: 249.0
      pages: 
     flesch: 43.0
      cache: ./cache/cord-000257-ampip7od.txt
        txt: ./txt/cord-000257-ampip7od.txt
    summary: With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. This likely stems from the fact that they are required to participate in many different interactions, which makes selection pressures more stringent and the appearance of the branches on phylogenetic trees relatively short and more difficult to assess when co-evolutionary data in terms of other domains in the same gene family or expression patterns is limited [42, 63] . This approach thus primarily focuses on the similarity and differences of the orthologous genes within network, and is therefore ideally suited for the study of protein domain evolution and has already revealed that species-specific parts Fig.
   abstract: The proteomes that make up the collection of proteins in contemporary organisms evolved through recombination and duplication of a limited set of domains. These protein domains are essentially the main components of globular proteins and are the most principal level at which protein function and protein interactions can be understood. An important aspect of domain evolution is their atomic structure and biochemical function, which are both specified by the information in the amino acid sequence. Changes in this information may bring about new folds, functions and protein architectures. With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. Such investigations not only help predict the function of newly discovered proteins, but also assist in mapping unforeseen pathways of evolution and reveal crucial, co-evolving inter- and intra-molecular interactions. In turn this will help us describe how protein domains shaped cellular interaction networks and the dynamics with which they are regulated in the cell. Additionally, these studies can be used for the design of new and optimized protein domains for therapy. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945003/
        doi: 10.2174/138920210791616725

         id: cord-003316-r5te5xob
     author: Balloux, Francois
      title: From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic
       date: 2018-12-17
      words: 7340.0
  sentences: 327.0
      pages: 
     flesch: 34.0
      cache: ./cache/cord-003316-r5te5xob.txt
        txt: ./txt/cord-003316-r5te5xob.txt
    summary: WGS-based strain identification gives a far superior resolution In principle, WGS can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. As an example, genome assembly might appear to be a bottleneck for real-time WGS diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. These include, among others: the current costs of WGS, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable AMR and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols.
   abstract: Hospitals worldwide are facing an increasing incidence of hard-to-treat infections. Limiting infections and providing patients with optimal drug regimens require timely strain identification as well as virulence and drug-resistance profiling. Additionally, prophylactic interventions based on the identification of environmental sources of recurrent infections (e.g., contaminated sinks) and reconstruction of transmission chains (i.e., who infected whom) could help to reduce the incidence of nosocomial infections. WGS could hold the key to solving these issues. However, uptake in the clinic has been slow. Some major scientific and logistical challenges need to be solved before WGS fulfils its potential in clinical microbial diagnostics. In this review we identify major bottlenecks that need to be resolved for WGS to routinely inform clinical intervention and discuss possible solutions.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6249990/
        doi: 10.1016/j.tim.2018.08.004

         id: cord-291156-zxg3dsm3
     author: Bernasconi, Anna
      title: Empowering Virus Sequences Research through Conceptual Modeling
       date: 2020-05-01
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: The pandemic outbreak of the coronavirus disease has attracted attention towards the genetic mechanisms of viruses. We hereby present the Viral Conceptual Model (VCM), centered on the virus sequence and described from four perspectives: biological (virus type and hosts/sample), analytical (annotations and variants), organizational (sequencing project) and technical (experimental technology). VCM is inspired by GCM, our previously developed Genomic Conceptual Model, but it introduces many novel concepts, as viral sequences significantly differ from human genomes. When applied to SARS-CoV2 virus, complex conceptual queries upon VCM are able to replicate the search results of recent articles, hence demonstrating huge potential in supporting virology research. In addition to VCM, we also illustrate the data dictionary for patient’s phenotype used by the COVID-19 Host Genetic Initiative. Our effort is part of a broad vision: availability of conceptual models for both human genomics and viruses will provide important opportunities for research, especially if interconnected by the same human being, playing the role of virus host as well as provider of genomic and phenotype information.
        url: https://doi.org/10.1101/2020.04.29.067637
        doi: 10.1101/2020.04.29.067637

         id: cord-304869-l6a68tqn
     author: Bielińska-Wąż, Dorota
      title: Graphical and numerical representations of DNA sequences: statistical aspects of similarity
       date: 2011-08-28
      words: 15408.0
  sentences: 940.0
      pages: 
     flesch: 60.0
      cache: ./cache/cord-304869-l6a68tqn.txt
        txt: ./txt/cord-304869-l6a68tqn.txt
    summary: As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. How to restrict the graphs representing the sequences to two-dimensional plots and how to avoid degeneracies has been the subject of numerous studies which resulted in many graphical representations (see subsequent chapters). It is shown in the last chapter of this work that by using the four-component spectral representation one can recognize the difference in one base between a pair of sequences so it can be used for single nucleotide polymorfism (SNP) analyses which is subject of many investigation, as for example, in a recent work by Bhasi et al.
   abstract: New approaches aiming at a detailed similarity/dissimilarity analysis of DNA sequences are formulated. Several corrections that enrich the information which may be derived from the alignment methods are proposed. The corrections take into account the distributions along the sequences of the aligned bases (neglected in the standard alignment methods). As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. The studies are supplemented by detailed similarity studies for histones H1 and H4 coding sequences. The data are described according to the latest version of the EMBL database. The work is supplemented by a concise review of the state-of-art graphical representations of DNA sequences.
        url: https://www.ncbi.nlm.nih.gov/pubmed/32214591/
        doi: 10.1007/s10910-011-9890-8

         id: cord-310734-6v7oru2l
     author: Bolatti, Elisa M.
      title: A Preliminary Study of the Virome of the South American Free-Tailed Bats (Tadarida brasiliensis) and Identification of Two Novel Mammalian Viruses
       date: 2020-04-09
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: Bats provide important ecosystem services as pollinators, seed dispersers, and/or insect controllers, but they have also been found harboring different viruses with zoonotic potential. Virome studies in bats distributed in Asia, Africa, Europe, and North America have increased dramatically over the past decade, whereas information on viruses infecting South American species is scarce. We explored the virome of Tadarida brasiliensis, an insectivorous New World bat species inhabiting a maternity colony in Rosario (Argentina), by a metagenomic approach. The analysis of five pooled oral/anal swab samples indicated the presence of 43 different taxonomic viral families infecting a wide range of hosts. By conventional nucleic acid detection techniques and/or bioinformatics approaches, the genomes of two novel viruses were completely covered clustering into the Papillomaviridae (Tadarida brasiliensis papillomavirus type 1, TbraPV1) and Genomoviridae (Tadarida brasiliensis gemykibivirus 1, TbGkyV1) families. TbraPV1 is the first papillomavirus type identified in this host and the prototype of a novel genus. TbGkyV1 is the first genomovirus reported in New World bats and constitutes a new species within the genus Gemykibivirus. Our findings extend the knowledge about oral/anal viromes of a South American bat species and contribute to understand the evolution and genetic diversity of the novel characterized viruses.
        url: https://doi.org/10.3390/v12040422
        doi: 10.3390/v12040422

         id: cord-334127-wjf8t8vp
     author: Brister, J. Rodney
      title: NCBI Viral Genomes Resource
       date: 2015-01-28
      words: 3863.0
  sentences: 186.0
      pages: 
     flesch: 37.0
      cache: ./cache/cord-334127-wjf8t8vp.txt
        txt: ./txt/cord-334127-wjf8t8vp.txt
    summary: This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. Whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes (11) (12) (13) , sequence assembly and virus discovery (14) (15) (16) (17) , viral dynamics and evolution (18) (19) (20) and pathogen detection (14, (21) (22) (23) . The second model captures and standardizes host information for all viruses, and whenever a new RefSeq record is created, a manually curated ''viral host'' property is assigned to the relevant species within the NCBI Taxonomy database. The link to the Retrovirus Resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the Retrovirus Genotyping Tool and HIV-1, Human Interaction Database (50, 51) .
   abstract: Recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. Yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. The NCBI Viral Genomes Resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. The resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. As the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. The rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets.
        url: https://www.ncbi.nlm.nih.gov/pubmed/25428358/
        doi: 10.1093/nar/gku1207

         id: cord-203232-1nnqx1g9
     author: Canturk, Semih
      title: Machine-Learning Driven Drug Repurposing for COVID-19
       date: 2020-06-25
      words: 5023.0
  sentences: 257.0
      pages: 
     flesch: 52.0
      cache: ./cache/cord-203232-1nnqx1g9.txt
        txt: ./txt/cord-203232-1nnqx1g9.txt
    summary: Using the National Center for Biotechnology Information virus protein database and the DrugVirus database, which provides a comprehensive report of broad-spectrum antiviral agents (BSAAs) and viruses they inhibit, we trained ANN models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. Using sequences for SARS-CoV-2 (the coronavirus that causes COVID-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating COVID-19. For Experiment II, we split the data on virus species, meaning the models were forced to predict drugs for a species that it was not trained on, and have to detect peptide substructures in the amino-acid sequences to suggest drugs. In post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence.
   abstract: The integration of machine learning methods into bioinformatics provides particular benefits in identifying how therapeutics effective in one context might have utility in an unknown clinical context or against a novel pathology. We aim to discover the underlying associations between viral proteins and antiviral therapeutics that are effective against them by employing neural network models. Using the National Center for Biotechnology Information virus protein database and the DrugVirus database, which provides a comprehensive report of broad-spectrum antiviral agents (BSAAs) and viruses they inhibit, we trained ANN models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. Model training excluded SARS-CoV-2 proteins and included only Phases II, III, IV and Approved level drugs. Using sequences for SARS-CoV-2 (the coronavirus that causes COVID-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating COVID-19. Our results suggest multiple drug candidates, some of which complement recent findings from noteworthy clinical studies. Our in-silico approach to drug repurposing has promise in identifying new drug candidates and treatments for other viruses.
        url: https://arxiv.org/pdf/2006.14707v1.pdf
        doi: nan

         id: cord-328644-odtue60a
     author: Comandatore, Francesco
      title: Insurgence and worldwide diffusion of genomic variants in SARS-CoV-2 genomes
       date: 2020-05-28
      words: 6535.0
  sentences: 301.0
      pages: 
     flesch: 50.0
      cache: ./cache/cord-328644-odtue60a.txt
        txt: ./txt/cord-328644-odtue60a.txt
    summary: These variants might arise during the spread of the epidemic, as viruses are known for their high frequency of mutation, particularly in single stranded RNA viruses -as in the case of SARS-CoV-2 (Sanjuán and Domingo-Calap 2016) , which has a single, positive-strand RNA genome. To have a better insight on the history and spread of the COVID-19 pandemic in Italy and thanks to the sequences deposited in the Gisaid database, we identified 7 non synonymous mutations that are differentially frequent in Italian SARS-CoV-2 strains respect to strains circulating globally. Our analysis allowed us to identify 7 positions in four proteins that present drastic changes in amino acid frequencies when comparing Italian sequences with worldwide sequences available on Gisaid.org on April, 10, 2020 ( Figure 1 ).
   abstract: The SARS-CoV-2 pandemic that we are currently experiencing is exerting a massive toll both in human lives and economic impact. One of the challenges we must face is to try to understand if and how different variants of the virus emerge and change their frequency in time. Such information can be extremely valuable as it may indicate shifts in aggressiveness, and it could provide useful information to trace the spread of the virus in the population. In this work we identified and traced over time 7 amino acid variants that are present with high frequency in Italy and Europe, but that were absent or present at very low frequencies during the first stages of the epidemic in China and the initial reports in Europe. The analysis of these variants helps defining 6 phylogenetic clades that are currently spreading throughout the world with changes in frequency that are sometimes very fast and dramatic. In the absence of conclusive data at the time of writing, we discuss whether the spread of the variants may be due to a prominent founder effect or if it indicates an adaptive advantage.
        url: https://doi.org/10.1101/2020.04.30.071027
        doi: 10.1101/2020.04.30.071027

         id: cord-268549-2lg8i9r1
     author: Dai, Qi
      title: Sequence comparison via polar coordinates representation and curve tree
       date: 2012-01-07
      words: 4360.0
  sentences: 272.0
      pages: 
     flesch: 59.0
      cache: ./cache/cord-268549-2lg8i9r1.txt
        txt: ./txt/cord-268549-2lg8i9r1.txt
    summary: It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation
   abstract: Abstract Sequence comparison has become one of the essential bioinformatics tools in bioinformatics research, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Existing graphical representation methods have achieved promising results in sequence comparison, but there are some design challenges with the graphical representations and feature-based measures. We reported here a new method for sequence comparison. It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. The curve tree was then constructed to numerically characterize the closed curve of biological sequences, and further compared biological sequences by evaluating the distance of the curve tree of the query sequence matching against a corresponding curve tree of the template sequence. The proposed method was tested by phylogenetic analysis, and its performance was further compared with alignment-based methods. The results demonstrate that using polar coordinates representation and curve tree to compare sequences is more efficient.
        url: https://doi.org/10.1016/j.jtbi.2011.09.030
        doi: 10.1016/j.jtbi.2011.09.030

         id: cord-002473-2kpxhzbe
     author: Das, Jayanta Kumar
      title: Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach
       date: 2017-03-31
      words: 4613.0
  sentences: 285.0
      pages: 
     flesch: 61.0
      cache: ./cache/cord-002473-2kpxhzbe.txt
        txt: ./txt/cord-002473-2kpxhzbe.txt
    summary: Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. The primary protein sequence is read as consecutive order pairs serially from first amino acid to the end of sequence, and each order pair is nothing but a connected edge between the two nodes where nodes in the graph are involved with different chemical groups of amino acids. Our method of phylogenetic tree formation used the dissimilarity matrix which is obtained for every pair of sequence on the basis of chemical group specific score of amino acids. Based on the phylogenetic tree of five members, we find that the PpcA and PpcD, PpcB and PpcE are mostly closed with regards to the frequency of amino acids of respective eight chemical groups.
   abstract: Periplasmic c7 type cytochrome A (PpcA) protein is determined in Geobacter sulfurreducens along with its other four homologs (PpcB-E). From the crystal structure viewpoint the observation emerges that PpcA protein can bind with Deoxycholate (DXCA), while its other homologs do not. But it is yet to be established with certainty the reason behind this from primary protein sequence information. This study is primarily based on primary protein sequence analysis through the chemical basis of embedded amino acids. Firstly, we look for the chemical group specific score of amino acids. Along with this, we have developed a new methodology for the phylogenetic analysis based on chemical group dissimilarities of amino acids. This new methodology is applied to the cytochrome c7 family members and pinpoint how a particular sequence is differing with others. Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. Thirdly, we search for unique patterns as subsequences which are common among the group or specific individual member. In all the cases, we are able to show some distinct features of PpcA that emerges PpcA as an outstanding protein compared to its other homologs, resulting towards its binding with deoxycholate. Similarly, some notable features for the structurally dissimilar protein PpcD compared to the other homologs are also brought out. Further, the five members of cytochrome family being homolog proteins, they must have some common significant features which are also enumerated in this study.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5376323/
        doi: 10.1371/journal.pone.0175031

         id: cord-004862-yv76yvy5
     author: Demers, G. William
      title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin
       date: 1989
      words: 6659.0
  sentences: 347.0
      pages: 
     flesch: 62.0
      cache: ./cache/cord-004862-yv76yvy5.txt
        txt: ./txt/cord-004862-yv76yvy5.txt
    summary: title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. However, the region between the two ORFs is not conserved among species, and this observation is used to indicate possible start and stop codons for the ORFs. ORF-1 encodes a composite protein, and the 5'' half of ORF-1 from L1Oc is related to type II cytoskeletal keratin. The dot-plot analyses in Fig. 6 show that the internal sequence of L1Oc is very similar to both L1Md (mouse) and L1Hs (human) over very long segments, whereas the 5'' and 3'' ends are not conserved between species.
   abstract: The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. L1Oc repeats have a common 3′ end that terminates in a poly A addition signal and an A-rich tract, but individual repeats have different 5′ ends, indicating a polar truncation from the 5′ end during their synthesis or propagation. As a result of the polar truncations, the 5′ end of L1Oc is present in about 11,000 copies per haploid genome, whereas the 3′ end is present in at least 66,000 copies per haploid genome. One type of L1Oc repeat has internal direct repeats of 78 bp in the 3′ untranslated region, whereas other L1Oc repeats have only one copy of this sequence. The longest repeat sequenced, L1Oc5, is 6.5 kb long, and genomic blot-hybridization data using probes from the 5′ end of L1Oc5 indicate that a full length L1Oc repeat is about 7.5 kb long, extending about 1 kb 5′ to the sequenced region. The L1Oc5 sequence has long open reading frames (ORFs) that correspond to ORF-1 and ORF-2 described in the mouse L1 sequence. In contrast to the overlapping reading frames seen for mouse L1, ORF-1 and ORF-2 are in the same reading frame in rabbit and human L1s, resulting in a discistronic structure. The region between the likely stop codon for ORF-1 and the proposed start codon for ORF-2 is not conserved in interspecies comparisons, which is further evidence that this short region does not encode part of a protein. ORF-1 appears to be a hybrid of sequences, of which the 3′ half is unique to and conserved in mammalian L1 repeats. The 5′ half of ORF-1 is not conserved between mammalian L1 repeats, but this segment of L1Oc is related significantly to type II cytoskeletal keratin.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7087506/
        doi: 10.1007/bf02106177

         id: cord-339915-8j04y50s
     author: Deng, Wei
      title: DV-Curve Representation of Protein Sequences and Its Application
       date: 2014-05-08
      words: 2946.0
  sentences: 176.0
      pages: 
     flesch: 49.0
      cache: ./cache/cord-339915-8j04y50s.txt
        txt: ./txt/cord-339915-8j04y50s.txt
    summary: Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins. In this paper, we introduce DV-curve graphical representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model of amino acids. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation New graphical representation of a DNA sequence based on the ordered dinucleotides and its application to sequence analysis Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation Similarity/dissimilarity studies of protein sequences based on a new 2d graphical representation
   abstract: Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. This graphical representation not only avoids degeneracy, but also has good visualization no matter how long these sequences are, and can reflect the length of protein sequence. Then we transform the 2D-graphical representation into a numerical characterization that can facilitate quantitative comparison of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins.
        url: https://doi.org/10.1155/2014/203871
        doi: 10.1155/2014/203871

         id: cord-255194-4i9fc0r7
     author: Djikeng, Appolinaire
      title: Viral genome sequencing by random priming methods
       date: 2008-01-07
      words: 3776.0
  sentences: 207.0
      pages: 
     flesch: 51.0
      cache: ./cache/cord-255194-4i9fc0r7.txt
        txt: ./txt/cord-255194-4i9fc0r7.txt
    summary: An RNase treatment step was added to the SISPA protocol to reduce contaminating exogenous RNAs such as ribosomal RNAs. In the case of polyA-tailed viruses, we perform reverse transcription using a combination of random (FR26RV-N) and poly T tagged (FR40RV-T) primers in order to increase the coverage of the 3'' end ( Figure 2 ). Additionally, in order to capture 5'' ends of viral RNA, a random hexamer primer tagged with a conserved sequence at the 5'' end was added to the Klenow reaction (Figure 2 shows a 5'' oligo specific for rhinoviruses). The results of these experiments demonstrate that the SISPA method is very efficient as a genome sequencing method for samples with greater than 10 6 viral particles per RT-PCR reaction ( Figure 5 ). We strongly anticipate that specific adaptations of the SISPA method to conserved regions of different viruses will demonstrate its versatility in a wide range of viral genome sequencing initiatives.
   abstract: BACKGROUND: Most emerging health threats are of zoonotic origin. For the overwhelming majority, their causative agents are RNA viruses which include but are not limited to HIV, Influenza, SARS, Ebola, Dengue, and Hantavirus. Of increasing importance therefore is a better understanding of global viral diversity to enable better surveillance and prediction of pandemic threats; this will require rapid and flexible methods for complete viral genome sequencing. RESULTS: We have adapted the SISPA methodology [1-3] to genome sequencing of RNA and DNA viruses. We have demonstrated the utility of the method on various types and sources of viruses, obtaining near complete genome sequence of viruses ranging in size from 3,000–15,000 kb with a median depth of coverage of 14.33. We used this technique to generate full viral genome sequence in the presence of host contaminants, using viral preparations from cell culture supernatant, allantoic fluid and fecal matter. CONCLUSION: The method described is of great utility in generating whole genome assemblies for viruses with little or no available sequence information, viruses from greatly divergent families, previously uncharacterized viruses, or to more fully describe mixed viral infections.
        url: https://doi.org/10.1186/1471-2164-9-5
        doi: 10.1186/1471-2164-9-5

         id: cord-266288-buc4dd5y
     author: Dong, Rui
      title: A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance
       date: 2019-04-09
      words: 5247.0
  sentences: 300.0
      pages: 
     flesch: 61.0
      cache: ./cache/cord-266288-buc4dd5y.txt
        txt: ./txt/cord-266288-buc4dd5y.txt
    summary: Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ(18). The natural vector method performs well on many datasets (Deng et al., 2011; Yu et al., 2013b; Hoang et al., 2016; Li et al., 2016) , however, it only considers the number, average position and dispersion of positions of each nucleotide. In this paper, we propose a new Accumulated Natural Vector (ANV) method, which not only considers the basic property of each nucleotide, but also the covariance between them. In this paper, we propose an Accumulated Natural Vector approach, which projects each sequence into a point in R 18 , where the additional six dimensions describe the covariance between nucleotides.
   abstract: Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ(18). By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in ℝ(18). The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.
        url: https://www.ncbi.nlm.nih.gov/pubmed/31024610/
        doi: 10.3389/fgene.2019.00234

         id: cord-033010-o5kiadfm
     author: Durojaye, Olanrewaju Ayodeji
      title: Potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study
       date: 2020-10-02
      words: 8125.0
  sentences: 375.0
      pages: 
     flesch: 53.0
      cache: ./cache/cord-033010-o5kiadfm.txt
        txt: ./txt/cord-033010-o5kiadfm.txt
    summary: RESULTS: This study describes the detailed computational process by which the 2019-nCoV main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3D structure. Our current study took advantage of the availability of the SARS CoV main proteinase amino acid sequence to map out the nucleotide coding region for the same protein in the 2019-nCoV. The predicted secondary structure composition shows a high degree of alpha helix and beta sheets, respectively, occupying 45 and 47% of the total residues with the percentage loop occupancy at 8% regarded as comparative modeling, constructs atomic models based on known structures or structures that have been determined experimentally and likewise share more than 40% sequence homology.
   abstract: BACKGROUND: The 2019-nCoV which is regarded as a novel coronavirus is a positive-sense single-stranded RNA virus. It is infectious to humans and is the cause of the ongoing coronavirus outbreak which has elicited an emergency in public health and a call for immediate international concern has been linked to it. The coronavirus main proteinase which is also known as the 3C-like protease (3CLpro) is a very important protein in all coronaviruses for the role it plays in the replication of the virus and the proteolytic processing of the viral polyproteins. The resultant cytotoxic effect which is a product of consistent viral replication and proteolytic processing of polyproteins can be greatly reduced through the inhibition of the viral main proteinase activities. This makes the 3C-like protease of the coronavirus a potential and promising target for therapeutic agents against the viral infection. RESULTS: This study describes the detailed computational process by which the 2019-nCoV main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3D structure. Comparative physiochemical studies were carried out on the resultant target protein and its template while selected HIV protease inhibitors were docked against the protein binding sites which contained no co-crystallized ligand. CONCLUSION: In line with results from this study which has shown great consistency with other scientific findings on coronaviruses, we recommend the administration of the selected HIV protease inhibitors as first-line therapeutic agents for the treatment of the current coronavirus epidemic.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7529470/
        doi: 10.1186/s43042-020-00081-5

         id: cord-001786-ybd8hi8y
     author: Dutilh, Bas E
      title: Metagenomic ventures into outer sequence space
       date: 2014-12-15
      words: 2193.0
  sentences: 121.0
      pages: 
     flesch: 44.0
      cache: ./cache/cord-001786-ybd8hi8y.txt
        txt: ./txt/cord-001786-ybd8hi8y.txt
    summary: These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. Applications include the use of metagenomics for the discovery of novel genetic functionality, 2 for describing microbial ecosystems and tracking their variation, 3 in untargeted medical diagnostics and forensics, 4 and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. The level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database.
   abstract: Sequencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as “unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as “biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4588555/
        doi: 10.4161/21597081.2014.979664

         id: cord-334394-qgyzk7th
     author: Edgar, Robert C.
      title: Petabase-scale sequence alignment catalyses viral discovery
       date: 2020-08-10
      words: 8134.0
  sentences: 423.0
      pages: 
     flesch: 51.0
      cache: ./cache/cord-334394-qgyzk7th.txt
        txt: ./txt/cord-334394-qgyzk7th.txt
    summary: To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To expand the known repertoire of viruses and catalyse global virus discovery, in particular for Coronaviridae (CoV) family, we developed the Serratus cloud computing architecture for ultra-high throughput sequence alignment. We aligned 3,837,755 public RNA-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all GenBank CoV records clustered at 99% identity plus all non-retroviral RefSeq records for vertebrate viruses (see Methods and Extended Table 1 ). We performed de novo assembly on 52,772 runs potentially containing CoV sequencing reads by combining 37,131 SRA accessions identified by the Serratus search with 18,584 identified by an ongoing cataloguing initiative of the SRA called STAT [5] .
   abstract: Public sequence data represents a major opportunity for viral discovery, but its exploration has been inhibited by a lack of efficient methods for searching this corpus, which is currently at the petabase scale and growing exponentially. To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To implement this strategy, we developed a cloud computing architecture, Serratus, tailored for ultra-high throughput sequence alignment at the petabase scale. From this search, we identified and assembled thousands of CoV and CoV-like genomes and genome fragments ranging from known strains to putatively novel genera. We generalise this strategy to other viral families, identifying several novel deltaviruses and huge bacteriophages. To catalyse a new era of viral discovery we made millions of viral alignments and family identifications freely available to the research community. Expanding the known diversity and zoonotic reservoirs of CoV and other emerging pathogens can accelerate vaccine and therapeutic developments for the current pandemic, and help us anticipate and mitigate future ones.
        url: https://doi.org/10.1101/2020.08.07.241729
        doi: 10.1101/2020.08.07.241729

         id: cord-011565-8ncgldaq
     author: Elworth, R A Leo
      title: To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics
       date: 2020-06-04
      words: 12960.0
  sentences: 717.0
      pages: 
     flesch: 53.0
      cache: ./cache/cord-011565-8ncgldaq.txt
        txt: ./txt/cord-011565-8ncgldaq.txt
    summary: For instance, in (1) a comprehensive review was performed covering probabilistic algorithms and data structures such as MinHash (6) and Locality Sensitive Hashing (LSH) (7) , Count-Min Sketch (CMS) (8) , HyperLogLog (9) and Bloom filters (10) . A more in depth discussion of many of these topics can also be found in (3, 4) includes a thorough review of compressed string indexes, LSH via sketches, CMS, Bloom filters, and minimizers (13) , with accompanying applications in genomics for each. With this approach, RAMBO can determine which datasets contain a given k-mer or sequence using far fewer Bloom filter queries, yielding a very fast sublinear-time sequence search algorithm (68) . One of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically MinHash and Minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses.
   abstract: As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7261164/
        doi: 10.1093/nar/gkaa265

         id: cord-256278-jvfjf7aw
     author: Feng, Jie
      title: New method for comparing DNA primary sequences based on a discrimination measure
       date: 2010-10-21
      words: 2864.0
  sentences: 186.0
      pages: 
     flesch: 53.0
      cache: ./cache/cord-256278-jvfjf7aw.txt
        txt: ./txt/cord-256278-jvfjf7aw.txt
    summary: title: New method for comparing DNA primary sequences based on a discrimination measure Three years after, Blaisdell (1989) proved that the dissimilarity values observed by using distance measures based on word frequencies are directly related to the ones requiring sequence alignment. In Table 2 , we present the similarity/dissimilarity matrix for the full DNA sequences of bÀglobin gene from 10 species listed in Table 1 by our new method. In Fig. 2, we show the phylogenetic tree of 10 bÀglobin gene sequences based on the distance matrix DM, using NJ method. In this paper, we propose a new method for the similarity analysis of DNA sequences. Our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of DNA sequences. Analysis of similarity/ dissimilarity of DNA sequences based on novel 2-D graphical representation A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words
   abstract: Abstract We introduce a new approach to compare DNA primary sequences. The core of our method is a new measure of pairwise distances among sequences. Using the primitive discrimination substrings of sequence S and Q, a discrimination measure DM(S, Q) is defined for the similarity analysis of them. The proposed method does not require multiple alignments and is fully automatic. To illustrate its utility, we construct phylogenetic trees on two independent data sets. The results indicate that the method is efficient and powerful.
        url: https://www.sciencedirect.com/science/article/pii/S0022519310003978
        doi: 10.1016/j.jtbi.2010.07.040

         id: cord-016594-lj0us1dq
     author: Flower, Darren R.
      title: Identification of Candidate Vaccine Antigens In Silico
       date: 2012-09-28
      words: 12570.0
  sentences: 653.0
      pages: 
     flesch: 37.0
      cache: ./cache/cord-016594-lj0us1dq.txt
        txt: ./txt/cord-016594-lj0us1dq.txt
    summary: In the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. When looking at a reverse vaccinology process, the discovery of candidate subunit vaccines begins with a microbial genome, perhaps newly sequence, progresses through an extensive computational stage, ultimately to deliver a shortlist of antigens which can be validated through subsequent laboratory examination. Conventional empirical, experimental, laboratory-based microbiological ways to identify putative candidate antigens require cultivation of target pathogenic micro-organisms, followed by teasing out their component proteins, analysis in a series of in-vitro and in-vivo assays, animal models and with the ultimate objective of isolating one or two proteins displaying protective immunity.
   abstract: The identification of immunogenic whole-protein antigens is fundamental to the successful discovery of candidate subunit vaccines and their rapid, effective, and efficient transformation into clinically useful, commercially successful vaccine formulations. In the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. Reference is also made to the recent emergence of various expert systems for protein antigen identification.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7120937/
        doi: 10.1007/978-1-4614-5070-2_3

         id: cord-001974-wjf3c7a7
     author: Friis-Nielsen, Jens
      title: Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
       date: 2016-02-19
      words: 5773.0
  sentences: 348.0
      pages: 
     flesch: 48.0
      cache: ./cache/cord-001974-wjf3c7a7.txt
        txt: ./txt/cord-001974-wjf3c7a7.txt
    summary: Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. The datasets went through a sequential pipeline with modules (in order) of preprocessing, computational subtraction of host sequences, low-complexity sequence removal, sequence assembly, clustering, association to metadata features, and taxonomical annotation. Associations from the shortest mode tended to have higher dispersion in the range of ORs. Furthermore, one block of clustering results using global alignment mode, alignment length based on the shortest contig, and a minimum sequence identity of 90% (c09ˆaSyG1), had an overall high range of ORs as well as the highest minimum values. The clusters are significantly associated with lowest p-values to biological features and the species annotations are described by HMP.
   abstract: Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4776208/
        doi: 10.3390/v8020053

         id: cord-016798-tv2ntug6
     author: Gautam, Ablesh
      title: Bioinformatics Applications in Advancing Animal Virus Research
       date: 2019-06-06
      words: 6978.0
  sentences: 405.0
      pages: 
     flesch: 44.0
      cache: ./cache/cord-016798-tv2ntug6.txt
        txt: ./txt/cord-016798-tv2ntug6.txt
    summary: The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al.
   abstract: Viruses serve as infectious agents for all living entities. There have been various research groups that focus on understanding the viruses in terms of their host-viral relationships, pathogenesis and immune evasion. However, with the current advances in the field of science, now the research field has widened up at the ‘omics’ level. Apparently, generation of viral sequence data has been increasing. There are numerous bioinformatics tools available that not only aid in analysing such sequence data but also aid in deducing useful information that can be exploited in developing preventive and therapeutic measures. This chapter elaborates on bioinformatics tools that are specifically designed for animal viruses as well as other generic tools that can be exploited to study animal viruses. The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. Various databases that organize information on animal and human viruses have also been described. The chapter will converse on overview of the current advances, online and downloadable tools and databases in the field of bioinformatics that will enable the researchers to study animal viruses at gene level.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121192/
        doi: 10.1007/978-981-13-9073-9_23

         id: cord-302798-q0mbngqy
     author: Ge, Junwei
      title: Genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern China
       date: 2018-06-14
      words: 4343.0
  sentences: 273.0
      pages: 
     flesch: 58.0
      cache: ./cache/cord-302798-q0mbngqy.txt
        txt: ./txt/cord-302798-q0mbngqy.txt
    summary: In this study, the role of circoviruses (CVs) in mink acute gastroenteritis was investigated, and the MiCV genome was molecularly characterized through sequence analysis. MiCVs and previously characterized CVs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus DNA replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (Rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5ʹ region; and (iv) motifs in Rep. Pairwise comparisons showed that the capsid proteins of MiCV shared the highest amino acid sequence identity with those of porcine CV (PCV) 2 (45.4%) and bat CV (BatCV) 1 (45.4%). In our study, sequence analysis confirmed that MiCV genomes displayed the characteristics of members of the genus Circovirus, and the common features included their genome organization, the presence of a potential stem-loop and conserved nonanucleotide motif postulated to be the origin of viral DNA replication, and major ORFs and repeats [26, 27] .
   abstract: Mink circovirus (MiCV), a virus that was newly discovered in 2013, has been associated with enteric disease. However, its etiological role in acute gastroenteritis is unclear, and its genetic characteristics are poorly described. In this study, the role of circoviruses (CVs) in mink acute gastroenteritis was investigated, and the MiCV genome was molecularly characterized through sequence analysis. Detection results demonstrated that MiCV was the only pathogen found in this infection. MiCVs and previously characterized CVs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus DNA replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (Rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5ʹ region; and (iv) motifs in Rep. Pairwise comparisons showed that the capsid proteins of MiCV shared the highest amino acid sequence identity with those of porcine CV (PCV) 2 (45.4%) and bat CV (BatCV) 1 (45.4%). The amino acid sequence identity levels of Rep shared by MiCV with BatCV 1 (79.7%) and dog CV (dogCV) (54.5%) were broadly similar to those with starling CV (51.1%) and PCVs (46.5%). Phylogenetic analysis indicated that MiCVs were more closely related to mammalian CVs, such as BatCV, PCV, and dogCV, than to other animal CVs. Among mammalian CVs, MiCV and BatCV 1 were the most closely related. This study could contribute to understanding the potential pathogenicity of MiCV and the evolutionary and pathogenic characteristics of mammalian CVs.
        url: https://www.ncbi.nlm.nih.gov/pubmed/29948383/
        doi: 10.1007/s00705-018-3908-5

         id: cord-017932-vmtjc8ct
     author: Georgiev, Vassil St.
      title: Genomic and Postgenomic Research
       date: 2009
      words: 8476.0
  sentences: 360.0
      pages: 
     flesch: 36.0
      cache: ./cache/cord-017932-vmtjc8ct.txt
        txt: ./txt/cord-017932-vmtjc8ct.txt
    summary: The family Enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (Salmonella, Yersinia, Klebsiella, Shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic Escherichia coli K12. To this end, NIAID has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. The availability of microbial and human DNA sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host''s immune response and an individual''s genetic susceptibility to pathogens. The PFGRC was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases.
   abstract: The word genomics was first coined by T. Roderick from the Jackson Laboratories in 1986 as the name for the new field of science focused on the analysis and comparison of complete genome sequences of organisms and related high-throughput technologies.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7122628/
        doi: 10.1007/978-1-60327-297-1_25

         id: cord-325043-vqjhiv7p
     author: Gorbalenya, Alexander E.
      title: An NTP-binding motif is the most conserved sequence in a highly diverged monophyletic group of proteins involved in positive strand RNA viral replication
       date: 1989
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: NTP-motif, a consensus sequence previously shown to be characteristic of numerous NTP-utilizing enzymes, was identified in nonstructural proteins of several groups of positive-strand RNA viruses. These groups include picorna-, alpha-, and coronaviruses infecting animals and como-, poty-, tobamo-, tricorna-, hordei-, and furoviruses of plants, totalling 21 viruses. It has been demonstrated that the viral NTP-motif-containing proteins constitute three distinct families, the sequences within each family being similar to each other at a statistically highly significant level. A lower, but still valid similarity has also been revealed between the families. An overall alignment has been generated, which includes several highly conserved sequence stretches. The two most prominent of the latter contain the socalled “A” and “B” sites of the NTP-motif, with four of the five invariant amino acid residues observed within these sequences. These observations, taken together with the results of comparative analysis of the positions occupied by respective proteins (domains) in viral multidomain proteins, suggest that all the NTP-motif-containing proteins of positive-strand RNA viruses are homologous, constituting a highly diverged monophyletic group. In this group the “A” and “B” sites of the NTP-motif are the most conserved sequences and, by inference, should play the principal role in the functioning of the proteins. A hypothesis is proposed that all these proteins posses NTP-binding capacity and possibly NTPase activity, performing some NTP-dependent function in viral RNA replication. The importance of phylogenetic analysis for the assessment of the significance of the occurrence of the NTP-motif (and of sequence motifs of this sort in general) in proteins is emphasized.
        url: https://www.ncbi.nlm.nih.gov/pubmed/2522556/
        doi: 10.1007/bf02102483

         id: cord-328259-3g4klpyg
     author: Guajardo-Leiva, Sergio
      title: Metagenomic Insights into the Sewage RNA Virosphere of a Large City
       date: 2020-09-21
      words: 7626.0
  sentences: 370.0
      pages: 
     flesch: 47.0
      cache: ./cache/cord-328259-3g4klpyg.txt
        txt: ./txt/cord-328259-3g4klpyg.txt
    summary: Despite the overrepresentation of dsRNA viruses, our results show that Santiago''s sewage RNA virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). Viral sequences identified as Partitiviridae-like viruses included in the "unclassified RNA viruses ShiM-2016" category in the NCBI taxonomy (~25% abundance; Figure 2B ) and Totiviriade family were also highly abundant in treated and untreated sewage samples from the EU [5, 7] . Therefore, the abundance of these viruses in the Trebal metagenome can expand the known sequence space associated with this family (only 10 genomes are currently available in the NCBI database) and contribute to a better understanding of the bacteriophage biology related to RNA genomes. Taken together, our results show that metagenomic surveys of RNA viruses in sewage samples and the use of HMMs could uncover extraordinary viral diversity through the detection of remote homologs in these human-impacted environments.
   abstract: Sewage-associated viruses can cause several human and animal diseases, such as gastroenteritis, hepatitis, and respiratory infections. Therefore, their detection in wastewater can reflect current infections within the source population. To date, no viral study has been performed using the sewage of any large South American city. In this study, we used viral metagenomics to obtain a single sample snapshot of the RNA virosphere in the wastewater from Santiago de Chile, the seventh largest city in the Americas. Despite the overrepresentation of dsRNA viruses, our results show that Santiago’s sewage RNA virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). Interestingly, we discovered three novel genogroups within the Picobirnaviridae family that can fill major gaps in this taxa’s evolutionary history. We also demonstrated the dominance of emerging Rotavirus genotypes, such as G8 and G6, that have displaced other classical genotypes, which is consistent with recent clinical reports. This study supports the usefulness of sewage viral metagenomics for public health surveillance. Moreover, it demonstrates the need to monitor the viral component during the wastewater treatment and recycling process, where this virome can constitute a reservoir of human pathogens.
        url: https://doi.org/10.3390/v12091050
        doi: 10.3390/v12091050

         id: cord-354465-5nqrrnqr
     author: Haslinger, Christian
      title: RNA structures with pseudo-knots: Graph-theoretical, combinatorial, and statistical properties
       date: 1999
      words: 10341.0
  sentences: 756.0
      pages: 
     flesch: 67.0
      cache: ./cache/cord-354465-5nqrrnqr.txt
        txt: ./txt/cord-354465-5nqrrnqr.txt
    summary: Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. In case of one particular class of biopolymers, the ribonucleic acid (RNA) molecules, decoding of information stored in the sequence can be properly decomposed into two steps: (i) formation of the secondary structure, that is, of the pattern of Watson-Crick (and GU) base pairs, and (ii) the embedding of the contact structure in three-dimensional space. On the other hand, an increasing number of experimental findings, as well as results from comparative sequence analysis, suggest that pseudo-knots are important structural elements in many RNA molecules (Westhof and Jaeger, 1992) .
   abstract: The secondary structures of nucleic acids form a particularly important class of contact structures. Many important RNA molecules, however, contain pseudo-knots, a structural feature that is excluded explicitly from the conventional definition of secondary structures. We propose here a generalization of secondary structures incorporating ‘non-nested’ pseudo-knots, which we call bi-secondary structures, and discuss measures for the complexity of more general contact structures based on their graph-theoretical properties. Bi-secondary structures are planar trivalent graphs that are characterized by special embedding properties. We derive exact upper bounds on their number (as a function of the chain length n) implying that there are fewer different structures than sequences. Computational results show that the number of bi-secondary structures grows approximately like 2.35(n). Numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of RNA do not change when pseudo-knots are introduced into the secondary structure picture. We find a large fraction of neutral mutations and, in particular, networks of sequences that fold into the same shape. These neutral networks percolate through the entire sequence space.
        url: https://www.ncbi.nlm.nih.gov/pubmed/17883226/
        doi: 10.1006/bulm.1998.0085

         id: cord-348427-worgd0xu
     author: Hatcher, Eneida L.
      title: Virus Variation Resource – improved response to emergent viral outbreaks
       date: 2017-01-04
      words: 5552.0
  sentences: 258.0
      pages: 
     flesch: 48.0
      cache: ./cache/cord-348427-worgd0xu.txt
        txt: ./txt/cord-348427-worgd0xu.txt
    summary: The resource now includes expanded data processing pipelines and analysis tools, and supports selection and retrieval of nucleotide and protein sequences from four new viral groups: Ebolaviruses, MERS coronavirus, rotavirus, and Zika virus ( Table 2 ). New processes have been added to parse source descriptor terms from Gen-Bank records and map these to controlled vocabulary, and the resource now supports retrieval of sequences based on standardized isolation source and host terms in addition to standardized gene and protein names. The resource includes data processing pipelines that retrieve sequences from GenBank, provide standardized gene and protein an-notation, and map sequence source descriptors (i.e. metadata) to uniform vocabularies. To resolve this issue, the Virus Variation database loading pipeline parses Gen-Bank records, identifies important metadata terms, such as sample isolation host, date, country and source, and maps these to a standardized vocabulary using a hierarchical approach.
   abstract: The Virus Variation Resource is a value-added viral sequence data resource hosted by the National Center for Biotechnology Information. The resource is located at http://www.ncbi.nlm.nih.gov/genome/viruses/variation/ and includes modules for seven viral groups: influenza virus, Dengue virus, West Nile virus, Ebolavirus, MERS coronavirus, Rotavirus A and Zika virus. Each module is supported by pipelines that scan newly released GenBank records, annotate genes and proteins and parse sample descriptors and then map them to controlled vocabulary. These processes in turn support a purpose-built search interface where users can select sequences based on standardized gene, protein and metadata terms. Once sequences are selected, a suite of tools for downloading data, multi-sequence alignment and tree building supports a variety of user directed activities. This manuscript describes a series of features and functionalities recently added to the Virus Variation Resource.
        url: https://doi.org/10.1093/nar/gkw1065
        doi: 10.1093/nar/gkw1065

         id: cord-263987-ff6kor0c
     author: Holmes, Ian H.
      title: Solving the master equation for Indels
       date: 2017-05-12
      words: 7131.0
  sentences: 357.0
      pages: 
     flesch: 44.0
      cache: ./cache/cord-263987-ff6kor0c.txt
        txt: ./txt/cord-263987-ff6kor0c.txt
    summary: BACKGROUND: Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT: This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS: While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
   abstract: BACKGROUND: Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT: This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS: While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
        url: https://www.ncbi.nlm.nih.gov/pubmed/28494756/
        doi: 10.1186/s12859-017-1665-1

         id: cord-330067-ujhgb3b0
     author: Huang, Yi
      title: CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes
       date: 2007-10-02
      words: 3007.0
  sentences: 168.0
      pages: 
     flesch: 55.0
      cache: ./cache/cord-330067-ujhgb3b0.txt
        txt: ./txt/cord-330067-ujhgb3b0.txt
    summary: To overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, CoVDB (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. CoVDB provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. In CoVDB, with the aim of facilitating gene retrieval, we tried to unify the naming of these non-structural proteins from different groups of coronaviruses. When we compared their putative amino acid sequences to the corresponding ones in other group 1 coronavirus genomes using BLAST, as well as searching for conserved domains using motifscan, results showed that the putative proteins encoded by these ORFs belonged to a protein family in Pfam originally assigned as ''Corona_NS3b'' (accession number PF03053). database, CoVDB, of annotated coronavirus genes and genomes, which offers efficient batch sequence retrieval and analysis.
   abstract: The recent SARS epidemic has boosted interest in the discovery of novel human and animal coronaviruses. By July 2007, more than 3000 coronavirus sequence records, including 264 complete genomes, are available in GenBank. The number of coronavirus species with complete genomes available has increased from 9 in 2003 to 25 in 2007, of which six, including coronavirus HKU1, bat SARS coronavirus, group 1 bat coronavirus HKU2, groups 2c and 2d coronaviruses, were sequenced by our laboratory. To overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, CoVDB (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. CoVDB provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. Sequences can be directly downloaded from the website in FASTA format. CoVDB also provides detailed annotation of all coronavirus sequences using a standardized nomenclature system, and overcomes the problems of duplicated and identical sequences in other databases. For complete genomes, a single representative sequence for each species is available for comparative analysis such as phylogenetic studies. With the annotated sequences in CoVDB, more specific blast search results can be generated for efficient downstream analysis.
        url: https://www.ncbi.nlm.nih.gov/pubmed/17913743/
        doi: 10.1093/nar/gkm754

         id: cord-325985-xfzhn1n1
     author: Jabado, Omar J.
      title: Comprehensive viral oligonucleotide probe design using conserved protein regions
       date: 2007-12-13
      words: 4260.0
  sentences: 227.0
      pages: 
     flesch: 47.0
      cache: ./cache/cord-325985-xfzhn1n1.txt
        txt: ./txt/cord-325985-xfzhn1n1.txt
    summary: The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. Our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. The EMBL nucleotide sequence database [July 2007, Release 91; 461,353 nucleic acid sequences (31) ] was chosen as the reference for this study because it is tightly integrated with the Pfam protein family database (23, 32 Taxon growth was estimated using a standard least squares method, with the SPSS statistical package. We have described a method that capitalizes on the Pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions.
   abstract: Oligonucleotide microarrays have been applied to microbial surveillance and discovery where highly multiplexed assays are required to address a wide range of genetic targets. Although printing density continues to increase, the design of comprehensive microbial probe sets remains a daunting challenge, particularly in virology where rapid sequence evolution and database expansion confound static solutions. Here, we present a strategy for probe design based on protein sequences that is responsive to the unique problems posed in virus detection and discovery. The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. In silico testing using an experimentally derived thermodynamic model indicated near complete coverage of the viral sequence database.
        url: https://www.ncbi.nlm.nih.gov/pubmed/18079152/
        doi: 10.1093/nar/gkm1106

         id: cord-017354-cndb031c
     author: Janies, D.
      title: Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases
       date: 2008
      words: 12429.0
  sentences: 648.0
      pages: 
     flesch: 45.0
      cache: ./cache/cord-017354-cndb031c.txt
        txt: ./txt/cord-017354-cndb031c.txt
    summary: The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. Given a tree and a data matrix of sequences and features, the parsimony method can pinpoint the branches on which certain evolutionary events are inferred to occur between ancestor or descendent. Phylogenetic analysis of large genomic datasets can present several nested NPcomplete problems: multiple alignment, tree-search, and in some cases, gene order and complement differences among organisms. We provide exemplar cases in which phylogenetic analyses of viral genomes have been crucial to understand complex patterns of transmission among animal and human hosts: Severe Acute Respiratory Syndrome (SARS) [KSI03] and influenza [WEB92] . Molecular phylogenetic analyses of the nucleotide or inferred amino acid sequence data from various viral isolates can then be used to reconstruct the history of the transmission events the virus among hosts.
   abstract: Microorganisms that cause infectious diseases present critical issues of national security, public health, and economic welfare. For example, in recent years, highly pathogenic strains of avian influenza have emerged in Asia, spread through Eastern Europe, and threaten to become pandemic. As demonstrated by the coordinated response to Severe Acute Respiratory Syndrome (SARS) and influenza, agents of infectious disease are being addressed via large-scale genomic sequencing. The goal of genomic sequencing projects are to rapidly put large amounts of data in the public domain to accelerate research on disease surveillance, treatment, and prevention. However, our ability to derive information from large comparative genomic datasets lags far behind acquisition. Here we review the computational challenges of comparative genomic analyses, specifically sequence alignment and reconstruction of phylogenetic trees. We present novel analytical results on two important infectious diseases, Severe Acute Respiratory Syndrome (SARS) and influenza. SARS and influenza have similarities and important differences both as biological and comparative genomic analysis problems. Influenza viruses (Orthymxyoviridae) are RNA based. Current evidence indicates that influenza viruses originate in aquatic birds from wild populations. Influenza has been studied for decades via well-coordinated international efforts. These efforts center on surveillance via antibody characterization of the hemagglutinin (HA) and neuraminidase (N) proteins of the circulating strains to inform vaccine design. However, we still do not have a clear understanding of (1) various transmission pathways such as the role of intermediate hosts like swine and domestic birds and (2) the key mutation and genomic recombination events that underlie periodic pandemics of influenza. In the past 30 years, sequence data from HA and N loci has become an important data type. In the past year, full genomic data has become prominent. These data present exciting opportunities to address unanswered questions in influenza pandemics. SARS is caused by a previously unrecognized lineage of coronavirus, SARS-CoV, which like influenza has an RNA based genome. Although SARS-CoV is widely believed to have originated in animals, there remains disagreement over the candidate animal source that lead to the original outbreak of SARS. In contrast to the long history of the study of influenza, SARS was only recognized in late 2002 and the virus that causes SARS has been documented primarily by genomic sequencing. In the past, most studies of influenza were performed on a limited number of isolates and genes suited to a particular problem. Major goals in science today are to understand emerging diseases in broad geographic, environmental, societal, biological, and genomic contexts. Synthesizing diverse information brought together by various researchers is important to find out what can be done to prevent future outbreaks [JON03]. Thus comprehensive means to organize and analyze large amounts of diverse information are critical. For example, the relationships of isolates and patterns of genomic change observed in large datasets might not be consistent with hypotheses formed on partial data. Moreover when researchers rely on partial datasets, they restrict the range of possible discoveries. Phylogenetics is well suited to the complex task of understanding emerging infectious disease. Phylogenetic analyses can test many hypotheses by comparing diverse isolates collected from various hosts, environments, and points in time and organizing these data into various evolutionary scenarios. The products of a phylogenetic analysis are a graphical tree of ancestor–descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. However, this synthesis comes at a price. The cost of computation of phylogenetic analysis expands combinatorially as the number of isolates considered increases. Thus, large datasets like those currently produced are commonly considered intractable. We address this problem with synergistic development of heuristics tree search strategies and parallel computing.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121896/
        doi: 10.1007/978-3-540-74331-6_2

         id: cord-017584-9rx4jlw8
     author: Kim, Kwangsoo
      title: Selecting Genotyping Oligo Probes Via Logical Analysis of Data
       date: 2007
      words: 3665.0
  sentences: 216.0
      pages: 
     flesch: 57.0
      cache: ./cache/cord-017584-9rx4jlw8.txt
        txt: ./txt/cord-017584-9rx4jlw8.txt
    summary: Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. As for the organization of this paper, we develop an effective method for selecting short oligo probes in Section 2 (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in Section 3 with using viral genomic sequences from the Los Alamos National Laboratory and the National Center of Biotechnology Information websites.
   abstract: Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. These results well illustrate the utility of the proposed method in genotyping applications.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7122177/
        doi: 10.1007/978-3-540-72665-4_8

         id: cord-324021-y1vr1db0
     author: Kozak, M.
      title: Determinants of translational fidelity and efficiency in vertebrate mRNAs
       date: 1994-12-31
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: Abstract This article reviews current knowledge on the mechanisms affecting the fidelity of initiation codon selection, and discusses the effects of structural features in the 5′-non-coding region on the efficiency of translation of messenger RNA molecules.
        url: https://www.sciencedirect.com/science/article/pii/0300908494901821
        doi: 10.1016/0300-9084(94)90182-1

         id: cord-353290-1wi1dhv6
     author: Kustin, Talia
      title: Biased mutation and selection in RNA viruses
       date: 2020-09-28
      words: 7611.0
  sentences: 402.0
      pages: 
     flesch: 52.0
      cache: ./cache/cord-353290-1wi1dhv6.txt
        txt: ./txt/cord-353290-1wi1dhv6.txt
    summary: We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses. Nevertheless, RNA viruses do share several common features that drive their evolution: (a) their ultimate dependence on the cell, (b) their high mutation rates, (c) strong purifying selection derived from constraints operating on a small and densely coding genome, and (d) sporadic but powerful positive selection driven by an evolutionary arms race with the host they infect. Two non-mutually exclusive hypotheses may be put forth to explain the consistent pattern of A-richness that we observe: there is selection for more A in viral sequences, and/or there is a mutational bias that leads to more A in genomes of viruses.
   abstract: RNA viruses are responsible for some of the worst pandemics known to mankind, including outbreaks of Influenza, Ebola, and the recent COVID-19. One major challenge in tackling RNA viruses is the fact they are extremely genetically diverse. Nevertheless, they share common features that include their dependence on host cells for replication, and high mutation rates. We set out to search for shared evolutionary characteristics that may aid in gaining a broader understanding of RNA virus evolution, and constructed a phylogeny-based dataset spanning thousands of sequences from diverse single-stranded RNA viruses of animals. Strikingly, we found that the vast majority of these viruses have a skewed nucleotide composition, manifested as adenine rich (A-rich) coding sequences. In order to test whether A-richness is driven by selection or by biased mutation processes, we harnessed the effects of incomplete purifying selection at the tips of virus phylogenies. Our results revealed consistent mutational biases towards U rather than A in genomes of all viruses. In +ssRNA viruses we found that this bias is compensated by selection against U and selection for A, which leads to A-rich genomes. In -ssRNA viruses the genomic mutational bias towards U on the negative strand manifests as A-rich coding sequences, on the positive strand. We investigated possible reasons for the advantage of A-rich sequences including weakened RNA secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent RNA viruses.
        url: https://doi.org/10.1093/molbev/msaa247
        doi: 10.1093/molbev/msaa247

         id: cord-001340-kqcx7lrq
     author: Ladner, Jason T.
      title: Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing
       date: 2014-06-17
      words: 2512.0
  sentences: 121.0
      pages: 
     flesch: 40.0
      cache: ./cache/cord-001340-kqcx7lrq.txt
        txt: ./txt/cord-001340-kqcx7lrq.txt
    summary: Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/ DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization.
   abstract: Thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. However, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences. Here, we propose five “standard” categories that encompass all stages of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. We also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade-offs associated with different levels of finishing. Our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4068259/
        doi: 10.1128/mbio.01360-14

         id: cord-321150-ev6acl7b
     author: Lam, Ha Minh
      title: Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm
       date: 2017-10-03
      words: 3184.0
  sentences: 161.0
      pages: 
     flesch: 50.0
      cache: ./cache/cord-321150-ev6acl7b.txt
        txt: ./txt/cord-321150-ev6acl7b.txt
    summary: Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed. A strong descent or ascent in the middle of a HGRW indicates that one type of informative site exhibits clustering, and the properties of the random walk can be used to compute exact probabilities of this occurring. To illustrate improved runtimes and memory usage of the new 3SEQ algorithm, we searched for recombinants among large sequence data sets of dengue virus serotype 2, Ebola virus, the coronavirus responsible for Middle-East Respiratory Syndrome (MERS) and Zika virus; see table 1. The genomic alignments of MERS and Zika virus contained 1,150 and 2,792 polymorphic sites, respectively, and >99.9% triplets were able to be tested for mosaicism with exact P values.
   abstract: Identifying recombinant sequences in an era of large genomic databases is challenging as it requires an efficient algorithm to identify candidate recombinants and parents, as well as appropriate statistical methods to correct for the large number of comparisons performed. In 2007, a computation was introduced for an exact nonparametric mosaicism statistic that gave high-precision P values for putative recombinants. This exact computation meant that multiple-comparisons corrected P values also had high precision, which is crucial when performing millions or billions of tests in large databases. Here, we introduce an improvement to the algorithmic complexity of this computation from O(mn(3)) to O(mn(2)), where m and n are the numbers of recombination-informative sites in the candidate recombinant. This new computation allows for recombination analysis to be performed in alignments with thousands of polymorphic sites. Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed.
        url: https://doi.org/10.1093/molbev/msx263
        doi: 10.1093/molbev/msx263

         id: cord-025610-7vouj8pp
     author: Latif, Seemab
      title: Backward-Forward Sequence Generative Network for Multiple Lexical Constraints
       date: 2020-05-06
      words: 3923.0
  sentences: 230.0
      pages: 
     flesch: 50.0
      cache: ./cache/cord-025610-7vouj8pp.txt
        txt: ./txt/cord-025610-7vouj8pp.txt
    summary: In this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. Recently, Recurrent Neural Networks (RNNs) and their variants such as Long Short Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs) based language models have shown promising results in generating high quality text sequences, especially when the input and output are of variable length. first proposed multiple variants of Backward and Forward (B/F) language models based on GRUs for constrained sentence generation [13] . Therefore, we have proposed a neural probabilistic Backward-Forward architecture that can generate high quality sequences, with word embedding substitution method to satisfy multiple constraints. In this paper, we have proposed a novel method, dubbed Neural Probabilistic Backward-Forward language model and word embedding substitution method to address the issue of lexical constrained sequence generation.
   abstract: Advancements in Long Short Term Memory (LSTM) Networks have shown remarkable success in various Natural Language Generation (NLG) tasks. However, generating sequence from pre-specified lexical constraints is a new, challenging and less researched area in NLG. Lexical constraints take the form of words in the language model’s output to create fluent and meaningful sequences. Furthermore, most of the previous approaches cater this problem by allowing the inclusion of pre-specified lexical constraints during the decoding process, which increases the decoding complexity exponentially or linearly with the number of constraints. Moreover, some of the previous approaches can only deal with single constraint. Additionally, most of the previous approaches only deal with single constraints. In this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. Experiments shows that our proposed architecture outperforms previous methods in terms of intrinsic evaluation.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7256622/
        doi: 10.1007/978-3-030-49186-4_4

         id: cord-331698-rwow1ydx
     author: Latorre-Pérez, Adriel
      title: A lab in the field: applications of real-time, in situ metagenomic sequencing
       date: 2020-08-20
      words: 6732.0
  sentences: 335.0
      pages: 
     flesch: 36.0
      cache: ./cache/cord-331698-rwow1ydx.txt
        txt: ./txt/cord-331698-rwow1ydx.txt
    summary: This review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis. Therefore, the ultra-portability, affordability, and speed in data production make the MinION technology suitable for real-time sequencing in a variety of environments, such as Ebola surveillance in West Africa during the last outbreak [25] , microbial communities inspection in the Arctic [26] , DNA sequencing on the International Space Station (ISS) [27] , and even the recently emerging pandemic coronavirus SARS-CoV-2 [28, 29] . In fact, there are some critical points to be addressed before this technique could become a standard in the industry: (i) sequencing cost should be reduced; (ii) rapid and reliable in situ DNA extraction and library preparation protocols should be designed and validated; (iii) minimal sequencing yields should be determined for each specific application; (iv) fast and real-time pipelines should be created and tested; and (v) level of expertise for managing the data and the samples should be notably reduced.
   abstract: High-throughput metagenomic sequencing is considered one of the main technologies fostering the development of microbial ecology. Widely used second-generation sequencers have enabled the analysis of extremely diverse microbial communities, the discovery of novel gene functions, and the comprehension of the metabolic interconnections established among microbial consortia. However, the high cost of the sequencers and the complexity of library preparation and sequencing protocols still hamper the application of metagenomic sequencing in a vast range of real-life applications. In this context, the emergence of portable, third-generation sequencers is becoming a popular alternative for the rapid analysis of microbial communities in particular scenarios, due to their low cost, simplicity of operation, and rapid yield of results. This review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis.
        url: https://doi.org/10.1093/biomethods/bpaa016
        doi: 10.1093/biomethods/bpaa016

         id: cord-252347-vnn4135b
     author: Lee, Wai-Ming
      title: A Diverse Group of Previously Unrecognized Human Rhinoviruses Are Common Causes of Respiratory Illnesses in Infants
       date: 2007-10-03
      words: 5672.0
  sentences: 271.0
      pages: 
     flesch: 51.0
      cache: ./cache/cord-252347-vnn4135b.txt
        txt: ./txt/cord-252347-vnn4135b.txt
    summary: METHODS AND FINDINGS: To directly type HRVs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5'' noncoding region with homologous sequences of the 101 known serotypes. The degenerate primers EV292 and EV222 for PCR amplification of NIm-1A region were not sensitive enough for direct detection of small amount of HRV in original clinical samples (data not shown), and high titer infected cell lysates of cultured isolates were needed to produce enough PCR product for cloning and sequencing. This new assay had 3 key components: sensitive pan-HRV primers and semi-nested PCR to amplify P1-P2 region from cDNA prepared from original clinical specimens, a sequence database of 260-bp P1-P2 region of 5''NCR of all 101 HRV serotypes to serve as standard references for HRV identification, and phylogenetic tree reconstruction of the new P1-P2 sequences and the 101 homologous reference sequences.
   abstract: BACKGROUND: Human rhinoviruses (HRVs) are the most prevalent human pathogens, and consist of 101 serotypes that are classified into groups A and B according to sequence variations. HRV infections cause a wide spectrum of clinical outcomes ranging from asymptomatic infection to severe lower respiratory symptoms. Defining the role of specific strains in various HRV illnesses has been difficult because traditional serology, which requires viral culture and neutralization tests using 101 serotype-specific antisera, is insensitive and laborious. METHODS AND FINDINGS: To directly type HRVs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5' noncoding region with homologous sequences of the 101 known serotypes. Nasal samples from 26 infants were first tested with a multiplex PCR assay for respiratory viruses, and HRV was the most common virus found (108 of 181 samples). Typing was completed for 101 samples and 103 HRVs were identified. Surprisingly, 54 (52.4%) HRVs did not match any of the known serotypes and had 12–35% nucleotide divergence from the nearest reference HRVs. Of these novel viruses, 9 strains (17 HRVs) segregated from HRVA, HRVB and human enterovirus into a distinct genetic group (“C”). None of these new strains could be cultured in traditional cell lines. CONCLUSIONS: By molecular analysis, over 50% of HRV detected in sick infants were previously unrecognized strains, including 9 strains that may represent a new HRV group. These findings indicate that the number of HRV strains is considerably larger than the 101 serotypes identified with traditional diagnostic techniques, and provide evidence of a new HRV group.
        url: https://www.ncbi.nlm.nih.gov/pubmed/17912345/
        doi: 10.1371/journal.pone.0000966

         id: cord-338207-60vrlrim
     author: Lefkowitz, E.J.
      title: Virus Databases
       date: 2008-07-30
      words: 7957.0
  sentences: 368.0
      pages: 
     flesch: 48.0
      cache: ./cache/cord-338207-60vrlrim.txt
        txt: ./txt/cord-338207-60vrlrim.txt
    summary: (Each arrow points to the table containing the primary key.) Tables are color-coded according to the source of the information they contain: yellow, data obtained from the original GenBank sequence record and the ICTV Eighth Report; pink, data obtained from automated annotation or manual curation; blue, controlled vocabularies to ensure data consistency; green, administrative data. While most of us store our BLAST search results as files on our desktop computers, it is useful to store this information within the database to provide rapid access to similarity results for comparative purposes; to use these results to assign genes to orthologous families of related sequences; and to use these results in applications that analyze data in the database and, for example, display the results of an analysis between two or more types of viruses showing shared sets of common genes.
   abstract: As tools and technologies for the analysis of biological organisms (including viruses) have improved, the amount of raw data generated by these technologies has increased exponentially. Today's challenge, therefore, is to provide computational systems that support data storage, retrieval, display, and analysis in a manner that allows the average researcher to mine this information for knowledge pertinent to his or her work. Every article in this encyclopedia contains knowledge that has been derived in part from the analysis of such large data sets, which in turn are directly dependent on the databases that are used to organize this information. Fortunately, continual improvements in data-intensive biological technologies have been matched by the development of computational technologies, including those related to databases. This work forms the basis of many of the technologies that encompass the field of bioinformatics. This article provides an overview of database structure and how that structure supports the storage of biological information. The different types of data associated with the analysis of viruses are discussed, followed by a review of some of the various online databases that store general biological, as well as virus-specific, information.
        url: https://api.elsevier.com/content/article/pii/B9780123744104007196
        doi: 10.1016/b978-012374410-4.00719-6

         id: cord-342785-55r01n0x
     author: Lemmon, Gordon H
      title: Predicting the sensitivity and specificity of published real-time PCR assays
       date: 2008-09-25
      words: 4317.0
  sentences: 239.0
      pages: 
     flesch: 52.0
      cache: ./cache/cord-342785-55r01n0x.txt
        txt: ./txt/cord-342785-55r01n0x.txt
    summary: METHODS: We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. This analysis must include the predicted false negative and false positive rates for the developed signatures, and consider all available public sequence data. A freely available real time PCR analysis tool called TaqSim [4] was used to find public sequences that would match the primer/probe assay in question. However, according to the genomic data available, a better match of primers and probes to target is possible and is usually desired for high sensitivity detection. Current real-time PCR assay design approaches produce signatures with sensitivities generally too low for clinical use. Fifty Seven TaqMan PCR primer/probe combinations we predict to have higher sensitivity/specificity than current published assays. Development of quantitative gene-specific real-time RT-PCR assays for the detection of measles virus in clinical specimens
   abstract: BACKGROUND: In recent years real-time PCR has become a leading technique for nucleic acid detection and quantification. These assays have the potential to greatly enhance efficiency in the clinical laboratory. Choice of primer and probe sequences is critical for accurate diagnosis in the clinic, yet current primer/probe signature design strategies are limited, and signature evaluation methods are lacking. METHODS: We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. We found real-time PCR signatures described in recent literature and used a BLAST search based approach to collect all hits to the primer-probe combinations that should be amplified by real-time PCR chemistry. We then compared our hits with the sequences in the NCBI taxonomy tree that the signature was designed to detect. RESULTS: We found that many published signatures have high specificity (almost no false positives) but low sensitivity (high false negative rate). Where high sensitivity is needed, we offer a revised methodology for signature design which may designate that multiple signatures are required to detect all sequenced strains. We use this methodology to produce new signatures that are predicted to have higher sensitivity and specificity. CONCLUSION: We show that current methods for real-time PCR assay design have unacceptably low sensitivities for most clinical applications. Additionally, as new sequence data becomes available, old assays must be reassessed and redesigned. A standard protocol for both generating and assessing the quality of these assays is therefore of great value. Real-time PCR has the capacity to greatly improve clinical diagnostics. The improved assay design and evaluation methods presented herein will expedite adoption of this technique in the clinical lab.
        url: https://www.ncbi.nlm.nih.gov/pubmed/18817537/
        doi: 10.1186/1476-0711-7-18

         id: cord-321386-u1imic5l
     author: Li, Chun
      title: Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
       date: 2018-02-17
      words: 5503.0
  sentences: 311.0
      pages: 
     flesch: 59.0
      cache: ./cache/cord-321386-u1imic5l.txt
        txt: ./txt/cord-321386-u1imic5l.txt
    summary: METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. Numerical characterization of protein sequences based on the generalized Chou''s pseudo amino acid composition
   abstract: AIM AND OBJECTIVE: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. RESULTS: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82-33.85% in terms of F1M. CONCLUSION: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.
        url: https://doi.org/10.2174/1386207321666180130100838
        doi: 10.2174/1386207321666180130100838

         id: cord-306725-0vam15pt
     author: Li, Hao
      title: First detection and genomic characteristics of bovine torovirus in dairy calves in China
       date: 2020-05-09
      words: 3015.0
  sentences: 156.0
      pages: 
     flesch: 58.0
      cache: ./cache/cord-306725-0vam15pt.txt
        txt: ./txt/cord-306725-0vam15pt.txt
    summary: Sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the S protein compared to the complete S sequences of BToV available in the GenBank database. A phylogenetic analysis based on the complete amino acid sequence of the S protein showed that the BToVs could be separated into four groups (Fig. 2) , designated tentatively as group 1 to group 4. The bovine torovirus strains BToV/SC-1/China and BToV /SC-2/China investigated in this study are indicated by black triangles Fig. 2 Phylogenetic tree based on the deduced 1586-aa sequence of the complete S gene. Moreover, the two Chinese strains shared identical unique amino acid changes in the S and HE genes when compared to the other strains with sequences available in the GenBank database, indicating the unique evolution in Chinese BToV strains. Moreover, two complete BToV genome sequences were obtained from the clinical samples, and these two BToV isolates had unique amino acid changes in the S and HE proteins.
   abstract: Bovine torovirus (BToV) is a diarrhea-causing pathogen. In this study, 92 diarrheic fecal samples from five farms in four provinces in China were collected and tested for BToV using a RT-PCR assay, and 21.73% samples were found to be BToV positive. Moreover, two complete BToV genome sequences (MN073058 and MN073059) were obtained from the clinical samples, which were 28,297 and 28,301 nucleotides in length, respectively. Sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the S protein compared to the complete S sequences of BToV available in the GenBank database. In addition, seven consecutive amino acid mutations were found from aa 1,486 to 1,492 in the S protein of isolate MN073058. Moreover, the two isolates shared one identical amino acid mutation in the receptor binding sites of the HE protein. To the best of our knowledge, this is the first report on the epidemic and genomic characterization of BToV in China, which is helpful for further understanding the genetic evolution of BToV. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s00705-020-04657-9) contains supplementary material, which is available to authorized users.
        url: https://doi.org/10.1007/s00705-020-04657-9
        doi: 10.1007/s00705-020-04657-9

         id: cord-341879-vubszdp2
     author: Li, Lucy M
      title: Genomic analysis of emerging pathogens: methods, application and future trends
       date: 2014-11-22
      words: 5029.0
  sentences: 253.0
      pages: 
     flesch: 36.0
      cache: ./cache/cord-341879-vubszdp2.txt
        txt: ./txt/cord-341879-vubszdp2.txt
    summary: In this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases. In this review, we provide an overview of recent developments in genomic methods in the context of infectious diseases, evaluate integrative methods that incorporate genetic data in epidemiological analysis, and discuss the application of these methods to EIDs. Over the last two decades, sequence data have increased in quality, length and volume due to improvements in the underlying technology and decreasing costs. In recent cases of EIDs, genomic data have helped to classify and characterize the pathogen, uncover the population history of the disease, and produce estimates of epidemiological parameters. Just as compartmental models can be fitted to surveillance data to infer the epidemiological dynamics of an infectious disease (Box 1), the coalescent framework allows inference of population history from pathogen sequences.
   abstract: The number of emerging infectious diseases is increasing. Characterizing novel or re-emerging infections is aided by the availability of pathogen genomes. In this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases.
        url: https://www.ncbi.nlm.nih.gov/pubmed/25418281/
        doi: 10.1186/s13059-014-0541-9

         id: cord-345552-h6fwi0qn
     author: Li, Q.-G.
      title: Hydropathic characteristics of adenovirus hexons
       date: 1997-07-01
      words: 3522.0
  sentences: 206.0
      pages: 
     flesch: 53.0
      cache: ./cache/cord-345552-h6fwi0qn.txt
        txt: ./txt/cord-345552-h6fwi0qn.txt
    summary: The strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. The sequence of the predicted protein, consisting of 937 amino acids, was obtained with the LaserGene software program EditSeq. The hydropathy data of hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 were derived using the prediction method of Kyte-Doolittle in the LaserGene computer program Protean. The nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed serotypes of subgenera B, D and E to be closely related (Table 3 and Fig. 2) . DNA sequence of the adenovirus type 41 hexon gene and predicted structure of the protein
   abstract: The complete nucleotide sequence and the predicted amino acid sequence of the adenovirus type 7 hexon gene were determined. The hydro-pathy of the hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 was analysed. The presence of purines and pyrimid-ines in the second position of the codons was correlated to hydrophilicity and hydrophobicity, respectively. Comparison of the hydrophilicity plots of eight hexons showed seven hypervariable regions to be distributed on the surface. A large portion of the hypervariable regions manifests hydrophilicity. The strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. Analysis of codon usage for adenovirus hexons showed that among synony-mous codons those with cytidine in the third position were preferably used to a great extent. Analysis of the nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed members of subgenera B, D and E to be closely related, especially Ad4 and Ad16, and subgenus A to be closely related to subgenus F.
        url: https://www.ncbi.nlm.nih.gov/pubmed/9267445/
        doi: 10.1007/s007050050162

         id: cord-001537-i34vmfpp
     author: Lima, Francisco Esmaile de Sales
      title: Genomic Characterization of Novel Circular ssDNA Viruses from Insectivorous Bats in Southern Brazil
       date: 2015-02-17
      words: 3874.0
  sentences: 195.0
      pages: 
     flesch: 53.0
      cache: ./cache/cord-001537-i34vmfpp.txt
        txt: ./txt/cord-001537-i34vmfpp.txt
    summary: The predicted protein sequences encoded by ORF2 (cap) and ORF1 (rep) of BatCV I-VI genomes were used for phylogenetic analysis with representative and recently discovered circoviruses/cycloviruses; Pepper golden mosaic virus was used as outgroup, as they are somewhat related to other members in the Circoviridae family (Fig. 3A, 3B and 3C ). The phylogenetic analysis constructed based on the alignments of the complete REP and CAP protein confirms that BatCV POA/II and VI cluster into the genus Cyclovirus along with the Chinese cycloviruses sequences clade detected in bat feces [18] and sharing less than 65% of identity at the CAP/REP amino acid level. BatCV POA I and V had a low amino acid identity with CAP (<20%) and REP (<10%) sequences of two other sequences detected in bat feces in this study with known circoviruses/cycloviruses (Table 2) .
   abstract: Circoviruses are highly prevalent porcine and avian pathogens. In recent years, novel circular ssDNA genomes have recently been detected in a variety of fecal and environmental samples using deep sequencing approaches. In this study the identification of genomes of novel circoviruses and cycloviruses in feces of insectivorous bats is reported. Pan-reactive primers were used targeting the conserved rep region of circoviruses and cycloviruses to screen DNA bat fecal samples. Using this approach, partial rep sequences were detected which formed five phylogenetic groups distributed among the Circovirus and the recently proposed Cyclovirus genera of the Circoviridae. Further analysis using inverse PCR and Sanger sequencing led to the characterization of four new putative members of the family Circoviridae with genome size ranging from 1,608 to 1,790 nt, two inversely arranged ORFs, and canonical nonamer sequences atop a stem loop.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331541/
        doi: 10.1371/journal.pone.0118070

         id: cord-330312-1pjolkql
     author: Liu, Y.-T.
      title: Infectious Disease Genomics
       date: 2017-01-20
      words: 5168.0
  sentences: 327.0
      pages: 
     flesch: 45.0
      cache: ./cache/cord-330312-1pjolkql.txt
        txt: ./txt/cord-330312-1pjolkql.txt
    summary: One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. 16, 17 The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002. 30e32 Genome-sequencing projects for other important human disease vectors are in progress. 38 One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. 48 The completed or ongoing genome projects (Table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. Genome sequence of the human malaria parasite Plasmodium falciparum
   abstract: The history and development of infectious disease genomics have been closely associated with the Human Genome Project (HGP) during the past 20 years. It has been emphasized since the beginning of the HGP that such effort must not be restricted to the human genome and should include other organisms including mouse, bacteria, yeast, fruit fly, and worm for comparative sequence analyses. A brief history is reviewed in this chapter. As of 2016, more than 7000 completed genome sequencing projects have been reported. One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. A number of examples are discussed in this chapter.
        url: https://www.sciencedirect.com/science/article/pii/B978012799942500010X
        doi: 10.1016/b978-0-12-799942-5.00010-x

         id: cord-265857-fs6dj3dp
     author: Liu, Yu-Tsueng
      title: Infectious Disease Genomics
       date: 2010-12-24
      words: 4341.0
  sentences: 233.0
      pages: 
     flesch: 45.0
      cache: ./cache/cord-265857-fs6dj3dp.txt
        txt: ./txt/cord-265857-fs6dj3dp.txt
    summary: The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002 (Gardner et al., 2002; Holt et al., 2002) . Genome sequencing projects for other important human disease vectors are in progress Megy et al., 2009 ). One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. The completed or ongoing genome projects (Table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control.
   abstract: The history and development of infectious disease genomics are discussed in this chapter. HGP must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The polysaccharide capsule is important for meningococci to escape from complement-mediated killing. With the completion of the genome sequence of a virulent MenB strain, a “reverse vaccinology” approach was applied for the development of a universal MenB vaccine by Novartis. The indispensable fatty acid synthase (FAS) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents. Through a systematic screening of 250,000 natural product extracts, a Merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from Streptomyces platensis. Vector Biology Network was formed to achieve three goals (1) to develop basic tools for the stable transformation of anopheline mosquitoes by the year 2000; (2) to engineer a mosquito incapable of carrying the malaria parasite by 2005; and (3) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by 2010. The most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis.
        url: https://www.sciencedirect.com/science/article/pii/B9780123848901000108
        doi: 10.1016/b978-0-12-384890-1.00010-8

         id: cord-287658-c2lljdi7
     author: Lopez-Rincon, Alejandro
      title: Classification and Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning
       date: 2020-09-10
      words: 4766.0
  sentences: 253.0
      pages: 
     flesch: 55.0
      cache: ./cache/cord-287658-c2lljdi7.txt
        txt: ./txt/cord-287658-c2lljdi7.txt
    summary: The discovered sequences are first validated on samples from other repositories, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. The discovered sequences are validated on samples from NCBI and GISAID, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. For example, we can use this sequencing data with cDNA, resulting from the PCR of the original viral RNA; e,g, Real-Time PCR amplicons to identify the SARS-CoV-2 16 . The global impact of SARS-CoV-2 prompted researchers to apply effective alignment-free methods to the classification of the virus: For example, in 26 the authors propose the use of Machine Learning Digital Signal Processing for separating the virus from similar strains, with remarkable accuracy. We calculated the frequency of appearance of different primer sets'' sequences used in SARS-CoV-2 RT-PCR tests developed by WHO referral laboratories and compared it to our primer design in the dataset from the GISAID ( Table 2) repository.
   abstract: In this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in SARS-CoV-2. A convolutional neural network classifier is first trained on 553 sequences from available repositories, separating the genome of different virus strains from the Coronavirus family with considerable accuracy. The network’s behavior is then analyzed, to discover sequences used by the model to identify SARS-CoV-2, ultimately uncovering sequences exclusive to it. The discovered sequences are first validated on samples from other repositories, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. Next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets on existing datasets, obtaining competitive results. Finally, the primer is synthesized and tested on patient samples (n=6 previously tested positive), delivering a sensibility similar to routine diagnostic methods, and 100% specificity. In this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in SARS-CoV-2. A convolutional neural network classifier is first trained on 553 sequences from NGDC, separating the genome of different virus strains from the Coronavirus family with accuracy 98.73%. The network’s behavior is then analyzed, to discover sequences used by the model to identify SARS-CoV-2, ultimately uncovering sequences exclusive to it. The discovered sequences are validated on samples from NCBI and GISAID, and proven able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. Next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets, obtaining competitive results. Finally, the primer is synthesized and tested on patient samples (n=6 previously tested positive), delivering a sensibility similar to routine diagnostic methods, and 100% specificity. The proposed methodology has a substantial added value over existing methods, as it is able to both identify promising primer sets for a virus from a limited amount of data, and deliver effective results in a minimal amount of time. Considering the possibility of future pandemics, these characteristics are invaluable to promptly create specific detection methods for diagnostics.
        url: https://doi.org/10.1101/2020.03.13.990242
        doi: 10.1101/2020.03.13.990242

         id: cord-302161-ytr7ds8i
     author: Lutz, Mirjam
      title: FCoV Viral Sequences of Systemically Infected Healthy Cats Lack Gene Mutations Previously Linked to the Development of FIP
       date: 2020-07-24
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: Feline Infectious Peritonitis (FIP)—the deadliest infectious disease of young cats in shelters or catteries—is induced by highly virulent feline coronaviruses (FCoVs) emerging in infected hosts after mutations of less virulent FCoVs. Previous studies have shown that some mutations in the open reading frames (ORF) 3c and 7b and the spike (S) gene have implications for the development of FIP, but mainly indirectly, likely also due to their association with systemic spread. The aim of the present study was to determine whether FCoV detected in organs of experimentally FCoV infected healthy cats carry some of these mutations. Viral RNA isolated from different tissues of seven asymptomatic cats infected with the field strains FCoV Zu1 or FCoV Zu3 was sequenced. Deletions in the 3c gene and mutations in the 7b and S genes that have been shown to have implications for the development of FIP were not detected, suggesting that these are not essential for systemic viral dissemination. However, deletions and single nucleotide polymorphisms leading to truncations were detected in all nonstructural proteins. These were found across all analyzed ORFs, but with significantly higher frequency in ORF 7b than ORF 3a. Additionally, a previously unknown homologous recombination site was detected in FCoV Zu1.
        url: https://doi.org/10.3390/pathogens9080603
        doi: 10.3390/pathogens9080603

         id: cord-025948-6dsx7pey
     author: Maitra, Arindam
      title: Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility
       date: 2020-06-04
      words: 7218.0
  sentences: 382.0
      pages: 
     flesch: 56.0
      cache: ./cache/cord-025948-6dsx7pey.txt
        txt: ./txt/cord-025948-6dsx7pey.txt
    summary: Direct massively parallel sequencing of SARS-CoV-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in Eastern India. We have initiated a study on sequencing of SARS-CoV-2 genome from swab samples obtained from infected individuals from different regions of West Bengal in Eastern India and report here the first nine sequences and the results of analysis of the sequence data with respect to other sequences reported from the country until date. The A2a clade is characterized by the signature nonsynonymous mutations leading to amino acid changes of P323L in the RdRp which is involved in replication of the viral genome and the change of D614G in the Spike glycoprotein which is essential for the entry of the virus in the host cell by binding to the ACE2 receptor. We have also detected emergence of mutations in the important regions of the viral genome including Spike, RdRP and nucleocapsid coding genes.
   abstract: Direct massively parallel sequencing of SARS-CoV-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in Eastern India. Seven of the isolates belonged to the A2a clade, while one belonged to the B4 clade. Specific mutations, characteristic of the A2a clade, were also detected, which included the P323L in RNA-dependent RNA polymerase and D614G in the Spike glycoprotein. Further, our data revealed emergence of novel subclones harbouring nonsynonymous mutations, viz. G1124V in Spike (S) protein, R203K, and G204R in the nucleocapsid (N) protein. The N protein mutations reside in the SR-rich region involved in viral capsid formation and the S protein mutation is in the S(2) domain, which is involved in triggering viral fusion with the host cell membrane. Interesting correlation was observed between these mutations and travel or contact history of COVID-19 positive cases. Consequent alterations of miRNA binding and structure were also predicted for these mutations. More importantly, the possible implications of mutation D614G (in S(D) domain) and G1124V (in S(2) subunit) on the structural stability of S protein have also been discussed. Results report for the first time a bird’s eye view on the accumulation of mutations in SARS-CoV-2 genome in Eastern India. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s12038-020-00046-1) contains supplementary material, which is available to authorized users.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7269891/
        doi: 10.1007/s12038-020-00046-1

         id: cord-010161-bcuec2fz
     author: Matson, David O.
      title: IV, 6. Calicivirus RNA recombination
       date: 2004-09-14
      words: 3335.0
  sentences: 168.0
      pages: 
     flesch: 45.0
      cache: ./cache/cord-010161-bcuec2fz.txt
        txt: ./txt/cord-010161-bcuec2fz.txt
    summary: With the description of statistically significant phylogenetic clades within CV genera, data were available to recognize strains that might be natural recombinants within CVs. Two examples are the well-characterized Argentine strain 320 (Arg320) and Snow Mountain virus (SMV), one of the prototype CVs, recognized to be recombinants when the RNA polymerase and capsid regions of these strains were characterized (Hardy et al., 1997; Jiang et al., 1999) (Fig. 2) . While SMV was likely also to be a recombinant virus, the capsid and RNA polymerase region amplicons of SMV were generated separately and that fact did not exclude the possibility of different sources of strains. Infection of single cells simultaneously by two CVs implies absence of immune or molecular and of 40 nt near the 5'' end of that strain''s capsid gene (ID="B" sequence for this Fig.) . The sequence data indicated that recombination in strain Arg320 occurred at the ORF1/capsid gene junction where high sequence identity exists between the putative parent clades.
   abstract: RNA recombination apparently contributed to the evolution of CVs. Nucleic acid sequence homology or identity and similar RNA secondary structure of CVs and non-CVs may provide a locus for recombination within CVs or with non-CVs should co-infections of the same cell occur. Natural recombinants have been demonstrated among other enteric viruses, including Picornaviridae (Kirkegaard and Baltimore, 1986; Furione et al., 1993), Astroviridae (Walter et al., 2001), and possibly rotaviruses (e.g., Desselberger, 1996; Suzuki et al., 1998), augmenting the natural diversity of these pathogens and complicating viral gastroenteritis prevention strategies based upon traditional vaccines. Such is the case for CVs and Astroviridae, whose recombinant strains may be a common portion of naturally circulating strains. The taxonomic — and perhaps biologic — limits of recombination are defined by the suggested recombination of Nanovirus and CV, viruses from hosts of different biologic orders; the relationship of picornaviruses and CVs, viruses in different families, as recombination partners; and the intra-generic recombination between different clades of NLVs.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7172178/
        doi: 10.1016/s0168-7069(03)09032-3

         id: cord-275258-azpg5yrh
     author: Mead, Dylan J.T.
      title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling
       date: 2019-07-26
      words: 6333.0
  sentences: 346.0
      pages: 
     flesch: 53.0
      cache: ./cache/cord-275258-azpg5yrh.txt
        txt: ./txt/cord-275258-azpg5yrh.txt
    summary: title: Visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. We then present the first use of force-directed graphs to produce an intuitive visualization of sequence space, and select target RdRPs without solved structures for homology modelling. The solved structure has 10 other sequences in its proximity in the three-dimensional space, roughly Table 5 Homology modelling at intra-order, inter-family level.
   abstract: The protein sequence-structure gap results from the contrast between rapid, low-cost deep sequencing, and slow, expensive experimental structure determination techniques. Comparative homology modelling may have the potential to close this gap by predicting protein structure in target sequences using existing experimentally solved structures as templates. This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. Homology modelling was then carried out for target-template pairs in different species, different genera and different families, and model quality assessed using several metrics. Reconstructed ancestral RdRP sequences for individual genera were also used as templates for the production of ancestral RdRP homology models. High quality ancestral RdRP models were consistently produced, as were good quality models for target-template pairs in the same genus. Homology modelling between genera in the same family produced mixed results and inter-family modelling was unreliable. We present a protocol for the production of optimal RdRP homology models for use in further experiments, e.g. docking to discover novel anti-viral compounds. (219 words)
        url: https://www.sciencedirect.com/science/article/pii/S109332631930333X
        doi: 10.1016/j.jmgm.2019.07.014

         id: cord-027316-echxuw74
     author: Modarresi, Kourosh
      title: Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model
       date: 2020-05-22
      words: 2116.0
  sentences: 148.0
      pages: 
     flesch: 49.0
      cache: ./cache/cord-027316-echxuw74.txt
        txt: ./txt/cord-027316-echxuw74.txt
    summary: This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. Deep Learning has become a main model in natural language processing applications [6, 7, 11, 22, 38, 55, 64, 71, 75, 78-81, 85, 88, 94] . Though, modified version of RNN like LSTM and GRU have been improvement over RNN (recurrent neural networks) in dealing with vanishing gradients and long-term memory loss, still they suffer from many deficiencies. Given the complexity of these dependencies, a neural network model is used to compute these weights. The embedding regularization is, α Embedding Error 2 (6) Input to any model has to be a number and hence the raw input of words or text sequence needs to be transformed to continuous numbers. Learning phrase representations using RNN encoder-decoder for statistical machine translation
   abstract: Every individual text or document is generated for specific purpose(s). Sometime, the text is deployed to convey a specific message about an event or a product. Other occasions, it may be communicating a scientific breakthrough, development or new model and so on. Given any specific objective, the creators and the users of documents may like to know which part(s) of the documents are more influential in conveying their specific messages or achieving their objectives. Understanding which parts of a document has more impact on the viewer’s perception would allow the content creators to design more effective content. Detecting the more impactful parts of a content would help content users, such as advertisers, to concentrate their efforts more on those parts of the content and thus to avoid spending resources on the rest of the document. This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7304011/
        doi: 10.1007/978-3-030-50420-5_20

         id: cord-325750-x7jpsnxg
     author: Mokili, John L
      title: Metagenomics and future perspectives in virus discovery
       date: 2012-01-20
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: Monitoring the emergence and re-emergence of viral diseases with the goal of containing the spread of viral agents requires both adequate preparedness and quick response. Identifying the causative agent of a new epidemic is one of the most important steps for effective response to disease outbreaks. Traditionally, virus discovery required propagation of the virus in cell culture, a proven technique responsible for the identification of the vast majority of viruses known to date. However, many viruses cannot be easily propagated in cell culture, thus limiting our knowledge of viruses. Viral metagenomic analyses of environmental samples suggest that the field of virology has explored less than 1% of the extant viral diversity. In the last decade, the culture-independent and sequence-independent metagenomic approach has permitted the discovery of many viruses in a wide range of samples. Phylogenetically, some of these viruses are distantly related to previously discovered viruses. In addition, 60–99% of the sequences generated in different viral metagenomic studies are not homologous to known viruses. In this review, we discuss the advances in the area of viral metagenomics during the last decade and their relevance to virus discovery, clinical microbiology and public health. We discuss the potential of metagenomics for characterization of the normal viral population in a healthy community and identification of viruses that could pose a threat to humans through zoonosis. In addition, we propose a new model of the Koch's postulates named the ‘Metagenomic Koch's Postulates’. Unlike the original Koch's postulates and the Molecular Koch's postulates as formulated by Falkow, the metagenomic Koch's postulates focus on the identification of metagenomic traits in disease cases. The metagenomic traits that can be traced after healthy individuals have been exposed to the source of the suspected pathogen.
        url: https://doi.org/10.1016/j.coviro.2011.12.004
        doi: 10.1016/j.coviro.2011.12.004

         id: cord-000642-mkwpuav6
     author: Moreira, Rebeca
      title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing
       date: 2012-04-19
      words: 6848.0
  sentences: 372.0
      pages: 
     flesch: 45.0
      cache: ./cache/cord-000642-mkwpuav6.txt
        txt: ./txt/cord-000642-mkwpuav6.txt
    summary: title: Transcriptomics of In Vitro Immune-Stimulated Hemocytes from the Manila Clam Ruditapes philippinarum Using High-Throughput Sequencing The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum. Moreover, a few transcripts encoded by genes putatively involved in the clam immune response against Perkinsus olseni have been reported by cDNA library sequencing [18] . philippinarum transcriptome and another four bivalve species sequences were analyzed by comparative genomics (Crassostrea gigas of the family Ostreidae, Bathymodiolus azoricus and Mytilus galloprovincialis of the family Mytilidae and Laternula elliptica of the family Laternulidae).
   abstract: BACKGROUND: The Manila clam (Ruditapes philippinarum) is a worldwide cultured bivalve species with important commercial value. Diseases affecting this species can result in large economic losses. Because knowledge of the molecular mechanisms of the immune response in bivalves, especially clams, is scarce and fragmentary, we sequenced RNA from immune-stimulated R. philippinarum hemocytes by 454-pyrosequencing to identify genes involved in their immune defense against infectious diseases. METHODOLOGY AND PRINCIPAL FINDINGS: High-throughput deep sequencing of R. philippinarum using 454 pyrosequencing technology yielded 974,976 high-quality reads with an average read length of 250 bp. The reads were assembled into 51,265 contigs and the 44.7% of the translated nucleotide sequences into protein were annotated successfully. The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. We have found sequences from molecules never described in bivalves before, especially in the complement pathway where almost all the components are present. CONCLUSIONS: This study represents the first transcriptome analysis using 454-pyrosequencing conducted on R. philippinarum focused on its immune system. Our results will provide a rich source of data to discover and identify new genes, which will serve as a basis for microarray construction and the study of gene expression as well as for the identification of genetic markers. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334963/
        doi: 10.1371/journal.pone.0035009

         id: cord-311240-o0zyt2vb
     author: Motayo, Babatunde Olarenwaju
      title: Evolution and Genetic Diversity of SARSCoV-2 in Africa Using Whole Genome Sequences
       date: 2020-07-27
      words: 3091.0
  sentences: 167.0
      pages: 
     flesch: 50.0
      cache: ./cache/cord-311240-o0zyt2vb.txt
        txt: ./txt/cord-311240-o0zyt2vb.txt
    summary: Our study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa. The pathogen was later identified to be a novel coronavirus closely related to the severe acute respiratory syndrome virus (SARS), with a possible bat origin (Zhou et al, 2020) . This study was designed to determine to the genetic diversity and evolutionary history of genome sequences of SARSCoV-2 isolated in Africa. Results of recombination analysis of the African SARSCoV-2 (AfrSARSCoV-2) sequences against references whole genome sequences of SARS, Recombination signals were observed between the African SARSCoV-2 sequences and reference sequence (Major recombinant hCoV-19 Pangolin/Guangu P4L/2017; Minor parent hCoV-19 B batYunan/RaTG13) between the RdRP and S gene regions (Figure 2 ).
   abstract: The ongoing SARSCoV-2 pandemic was introduced into Africa on 14th February 2020 and has rapidly spread across the continent causing severe public health crisis and mortality. We investigated the genetic diversity and evolution of this virus during the early outbreak months using whole genome sequences. We performed; recombination analysis against closely related CoV, Bayesian time scaled phylogeny and investigated spike protein amino acid mutations. Results from our analysis showed recombination signals between the AfrSARSCoV-2 sequences and reference sequences within the N and S genes. The evolutionary rate of the AfrSARSCoV-2 was 4.133 × 10−4 high posterior density HPD (4.132 × 10−4 to 4.134 × 10−4) substitutions/site/year. The time to most recent common ancestor TMRCA of the African strains was December 7th 2019. The AfrSARCoV-2 sequences diversified into two lineages A and B with B being more diverse with multiple sub-lineages confirmed by both maximum clade credibility MCC tree and PANGOLIN software. There was a high prevalence of the D614-G spike protein amino acid mutation (82.61%) among the African strains. Our study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa.
        url: https://doi.org/10.1101/2020.07.27.222901
        doi: 10.1101/2020.07.27.222901

         id: cord-018459-isbc1r2o
     author: Munjal, Geetika
      title: Phylogenetics Algorithms and Applications
       date: 2018-12-10
      words: 1851.0
  sentences: 122.0
      pages: 
     flesch: 42.0
      cache: ./cache/cord-018459-isbc1r2o.txt
        txt: ./txt/cord-018459-isbc1r2o.txt
    summary: This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences. These alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15] . Constructing phylogenetic trees using multiple sequence alignment
   abstract: Phylogenetics is a powerful approach in finding evolution of current day species. By studying phylogenetic trees, scientists gain a better understanding of how species have evolved while explaining the similarities and differences among species. The phylogenetic study can help in analysing the evolution and the similarities among diseases and viruses, and further help in prescribing their vaccines against them. This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. The paper has also discussed the application of phylogenetic study in disease diagnosis and evolution.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7123334/
        doi: 10.1007/978-981-13-5934-7_17

         id: cord-264746-gfn312aa
     author: Muse, Spencer
      title: GENOMICS AND BIOINFORMATICS
       date: 2012-03-29
      words: 10976.0
  sentences: 583.0
      pages: 
     flesch: 58.0
      cache: ./cache/cord-264746-gfn312aa.txt
        txt: ./txt/cord-264746-gfn312aa.txt
    summary: The success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of DNA and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of DNA; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. Although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today''s environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism''s genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research.
   abstract: This chapter discusses the basic principles of molecular biology regarding genome science and describes the major types of data involved in genome projects, including technologies for collecting them. Genome science is heavily driven by new technological advances that allow for rapid and inexpensive collection of various types of data. The emergence of genomic science has not simply provided a rich set of tools and data for studying molecular biology. It has been the catalyst for an astounding burst of interdisciplinary research, and it has challenged long-established hierarchies found in most institutions of higher learning. The next generation of biologists needs to be as comfortable at a computer workstation as they are at the lab bench. Recognizing this fact, many universities have already reorganized their departments and their curricula to accommodate the demands of genomic science.The chapter discusses practical applications and uses of genomic data. For example, in the foreseeable future, are gene therapies that can repair genetic defects.
        url: https://api.elsevier.com/content/article/pii/B978012238662650015X
        doi: 10.1016/b978-0-12-238662-6.50015-x

         id: cord-321762-7kiahjyy
     author: Nandy, Ashesh
      title: Chapter 5 The GRANCH Techniques for Analysis of DNA, RNA and Protein Sequences
       date: 2015-12-31
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: Abstract: The very rapid growth in molecular sequence data from the daily accretion of large gene and protein sequencing projects have led to issues regarding viewing and analyzing the massive amounts of data. Graphical representation and numerical characterization of DNA, RNA and protein sequences have exhibited great potential to address these concerns. We review here in brief several different formulations of these representations and examples of applications to diverse problems based on what this author had presented at the Second Mathematical Chemistry Workshop of the Americas in Bogota, Colombia in 2010. In particular, we note several insights that were gained from such representations, and the applications to the bio-medicinal field.
        url: https://api.elsevier.com/content/article/pii/B9781681080536500053
        doi: 10.1016/b978-1-68108-053-6.50005-3

         id: cord-326225-crtpzad7
     author: Neill, John D.
      title: Simultaneous rapid sequencing of multiple RNA virus genomes
       date: 2014-06-01
      words: 3804.0
  sentences: 204.0
      pages: 
     flesch: 55.0
      cache: ./cache/cord-326225-crtpzad7.txt
        txt: ./txt/cord-326225-crtpzad7.txt
    summary: This procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3′-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. There is a wealth of information in these isolates, but up till now, it has been time consuming and expensive to sequence these viral genomes, often requiring sets of strain-specific primers for PCR amplification and sequencing. These primers were developed so that the 20 base known sequence was used for PCR amplification of the library as well as served as a barcode for identifying each viral library following pooling and sequencing. This virus, a BVDV 1b strain isolated from alpaca (GenBank accession JX297520.1; Table 2 , library 3, barcode 10), was assembled from Ion Torrent data and was found to have only 1 base difference from the sequence determined earlier (data not shown). One virus, library 1, barcode 9, had only 658 viral sequence reads but 94.4% of the genome was assembled.
   abstract: Comparing sequences of archived viruses collected over many years to the present allows the study of viral evolution and contributes to the design of new vaccines. However, the difficulty, time and expense of generating full-length sequences individually from each archived sample have hampered these studies. Next generation sequencing technologies have been utilized for analysis of clinical and environmental samples to identify viral pathogens that may be present. This has led to the discovery of many new, uncharacterized viruses from a number of viral families. Use of these sequencing technologies would be advantageous in examining viral evolution. In this study, a sequencing procedure was used to sequence simultaneously and rapidly multiple archived samples using a single standard protocol. This procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3′-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. This conferred sequence independence by random priming both first and second strand cDNA synthesis. Viral stocks were treated with a nuclease cocktail to reduce the presence of host nucleic acids. Viral RNA was extracted, followed by single tube random-primed double-stranded cDNA synthesis. The resultant cDNAs were amplified by primer-specific PCR, pooled, size fractionated and sequenced on the Ion Torrent PGM platform. The individual virus genomes were readily assembled by both de novo and template-assisted assembly methods. This procedure consistently resulted in near full length, if not full-length, genomic sequences and was used to sequence multiple bovine pestivirus and coronavirus isolates simultaneously.
        url: https://doi.org/10.1016/j.jviromet.2014.02.016
        doi: 10.1016/j.jviromet.2014.02.016

         id: cord-014461-2ubh9u8r
     author: Nelson, Oranmiyan W.
      title: Genome sequences published outside of Standards in Genomic Sciences, July - October 2012
       date: 2012-10-10
      words: 4124.0
  sentences: 454.0
      pages: 
     flesch: 44.0
      cache: ./cache/cord-014461-2ubh9u8r.txt
        txt: ./txt/cord-014461-2ubh9u8r.txt
    summary: Complete Genome Sequence of Brucella abortus A13334, a New Strain Isolated from the Fetal Gastric Fluid of Dairy Cattle Complete Genome Sequence of Brucella canis Strain HSK A52141, Isolated from the Blood of an Infected Dog Complete Genome Sequence of Streptococcus salivarius PS4, a Strain Isolated from Human Milk Complete Genome Sequences of Probiotic Strains Bifidobacterium animalis subsp. Complete Genome Sequence of Corynebacterium pseudotuberculosis Strain 1/06-A, Isolated from a Horse in North America Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Characterization and Complete Genome Sequence of Human Coronavirus NL63 Isolated in China Complete Genome Sequence of a Novel Pararetrovirus Isolated from Soybean Complete Genome Sequence of a Polyomavirus Isolated from Horses Complete Genome Sequence of a Novel Porcine Sapelovirus Strain YC2011 Isolated from Piglets with Diarrhea Draft Genome Sequence of Aspergillus oryzae Strain 3.042
   abstract: The purpose of this table is to provide the community with a citable record of publications of ongoing genome sequencing projects that have led to a publication in the scientific literature. While our goal is to make the list complete, there is no guarantee that we may have omitted one or more publications appearing in this time frame. Readers and authors who wish to have publications added to subsequent versions of this list are invited to provide the bibliographic data for such references to the SIGS editorial office.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3570808/
        doi: 10.4056/sigs.3416907

         id: cord-016293-pyb00pt5
     author: Newell-McGloughlin, Martina
      title: The flowering of the age of Biotechnology 1990–2000
       date: 2006
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: nan
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7120537/
        doi: 10.1007/1-4020-5149-2_4

         id: cord-255371-o9oxchq6
     author: Nguyen, Thanh Thi
      title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus)
       date: 2020-07-10
      words: 5640.0
  sentences: 365.0
      pages: 
     flesch: 59.0
      cache: ./cache/cord-255371-o9oxchq6.txt
        txt: ./txt/cord-255371-o9oxchq6.txt
    summary: title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. We use 6,324 SARS-CoV-2 genome sequences collected in 45 countries and deposited to the NCBI GenBank so far and create a spreadsheet dataset of all mutations occurred across different genes. In this paper, to evaluate the possible impacts of genomic mutations on the virus functions, we propose the use of the SSpro/ACCpro 5 methods to predict protein secondary structure and relative solvent accessibility [13] . By comparing the prediction results obtained on the reference genome and mutated genomes, we are able to assess whether the detected mutations have the potential to change the protein structure and solvent accessibility, and thus lead to possible changes of the virus characteristics.
   abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly pathogenic virus that has caused the global COVID-19 pandemic. Tracing the evolution and transmission of the virus is crucial to respond to and control the pandemic through appropriate intervention strategies. This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. Prediction results suggest that mutation D614G in the virus spike protein, which has attracted much attention from researchers, is unlikely to make changes in protein secondary structure and relative solvent accessibility. Based on 6,324 viral genome sequences, we create a spreadsheet dataset of point mutations that can facilitate the investigation of SARS-CoV-2 in many perspectives, especially in tracing the evolution and worldwide spread of the virus. Our analysis results also show that coding genes E, M, ORF6, ORF7a, ORF7b and ORF10 are most stable, potentially suitable to be targeted for vaccine and drug development.
        url: https://doi.org/10.1101/2020.07.10.171769
        doi: 10.1101/2020.07.10.171769

         id: cord-012975-u87ol3fs
     author: Ogiwara, Atsushi
      title: Construction of a dictionary of sequence motifs that characterize groups of related proteins
       date: 1992-09-17
      words: 3112.0
  sentences: 165.0
      pages: 
     flesch: 55.0
      cache: ./cache/cord-012975-u87ol3fs.txt
        txt: ./txt/cord-012975-u87ol3fs.txt
    summary: An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. The conserved amino acid patterns, often called consensus patterns or sequence motifs (Taylor, 1988; Hodgman, 1989) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. This procedure is applied to the superfamily grouping of the PIR database and a library of sequence motifs is constructed that identifies specific superfamilies. Functional groups of proteins Suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. Because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites.
   abstract: An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. This procedure is applied to the PIR database and a dictionary of sequence motifs that relate to specific superfamilies constructed. The motifs have a practical relevance in identifying the membership of specific superfamilies without the need to perform sequence database searches in 20% of newly determined sequences. The sequence motifs identified represent functionally important sites on protein molecules. When multiple blocks exist in a single motif they are often close together in the 3-D structure. Furthermore, occasionally these motif blocks were found to be split by introns when the correlation with exon structures was examined.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528547/
        doi: 10.1093/protein/5.6.479

         id: cord-355075-ieb35upi
     author: Papenfuss, Anthony T
      title: The immune gene repertoire of an important viral reservoir, the Australian black flying fox
       date: 2012-06-20
      words: 8952.0
  sentences: 480.0
      pages: 
     flesch: 54.0
      cache: ./cache/cord-355075-ieb35upi.txt
        txt: ./txt/cord-355075-ieb35upi.txt
    summary: alecto transcriptome provides information on a variety of immune genes not previously identified in any bat species and represents an important starting point for examining the antiviral activity of these molecules. To enrich for sequences corresponding to cytokines and innate immune genes, the second dataset was derived from pooled total RNA obtained from mitogen-stimulated spleen, white blood cells and lymph node and unstimulated thymus and bone marrow obtained from one pregnant female and one adult male flying fox. A full length transcript, encoding a 667 amino acid protein was identified in our bat transcriptome datasets and found to be orthologous to Mx1 based on comparison with known mammalian Mx1 and Mx2 family members (Figure 4a and data not shown). Genes involved in the adaptive immune system, including MHC class I and II genes and T and B cell receptors and co-receptors were highly represented in both the thymus and pooled datasets providing evidence that bats have all of the components necessary to mount an adaptive immune response.
   abstract: BACKGROUND: Bats are the natural reservoir host for a range of emerging and re-emerging viruses, including SARS-like coronaviruses, Ebola viruses, henipaviruses and Rabies viruses. However, the mechanisms responsible for the control of viral replication in bats are not understood and there is little information available on any aspect of antiviral immunity in bats. Massively parallel sequencing of the bat transcriptome provides the opportunity for rapid gene discovery. Although the genomes of one megabat and one microbat have now been sequenced to low coverage, no transcriptomic datasets have been reported from any bat species. In this study, we describe the immune transcriptome of the Australian flying fox, Pteropus alecto, providing an important resource for identification of genes involved in a range of activities including antiviral immunity. RESULTS: Towards understanding the adaptations that have allowed bats to coexist with viruses, we have de novo assembled transcriptome sequence from immune tissues and stimulated cells from P. alecto. We identified about 18,600 genes involved in a broad range of activities with the most highly expressed genes involved in cell growth and maintenance, enzyme activity, cellular components and metabolism and energy pathways. 3.5% of the bat transcribed genes corresponded to immune genes and a total of about 500 immune genes were identified, providing an overview of both innate and adaptive immunity. A small proportion of transcripts found no match with annotated sequences in any of the public databases and may represent bat-specific transcripts. CONCLUSIONS: This study represents the first reported bat transcriptome dataset and provides a survey of expressed bat genes that complement existing bat genomic data. In addition, these data provide insight into genes relevant to the antiviral responses of bats, and form a basis for examining the roles of these molecules in immune response to viral infection.
        url: https://doi.org/10.1186/1471-2164-13-261
        doi: 10.1186/1471-2164-13-261

         id: cord-304607-td0776wj
     author: Paszkiewicz, Konrad H.
      title: Omics, Bioinformatics, and Infectious Disease Research
       date: 2010-12-24
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: Bioinformatics is basically the study of informatic processes in biotic systems. Actually what constitutes bioinformatics is not entirely clear and arguably varies depending on who tries to define it. This chapter discusses the considerable progress in infectious diseases research that has been made in recent years using various “omics” case studies. Bioinformatics is tasked with making sense of it, mining it, storing it, disseminating it, and ensuring valid biological conclusions can be drawn from it. This chapter discusses the current state of play of bioinformatics related to genomics and transcriptomics, briefs metagenomics that finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms. This chapter explains the various possibilities of pan-genome, transcriptional reshaping and also enormous progress of proteomics study. Bioinformatic algorithms and tools are crucial tools in analyzing the data. The chapter also attempts to provide some details on the various problems and solution in bioinformatics that current-day scientists face while concentrating on second-generation sequencing strategies.
        url: https://api.elsevier.com/content/article/pii/B9780123848901000182
        doi: 10.1016/b978-0-12-384890-1.00018-2

         id: cord-264135-s2u76pvk
     author: Patel, Amrutlal K.
      title: Complete genome sequence analysis of chicken astrovirus isolate from India
       date: 2016-12-23
      words: 3755.0
  sentences: 217.0
      pages: 
     flesch: 49.0
      cache: ./cache/cord-264135-s2u76pvk.txt
        txt: ./txt/cord-264135-s2u76pvk.txt
    summary: Phylogenetic analysis of the astrovirus genomes suggested formation of separate cluster of chicken astroviruses and placed CAstV/INDIA/ANAND/2016 nearest to the CAstV/4175 isolate (Fig. 2) . B-cell epitope analysis of capsid structural protein of identified chicken astrovirus isolate A total of 9-10 epitopes were predicted using SVMTriP using the capsid protein sequence of the astroviruses. Phylogenetic analysis of the genome sequences as well as the protein sequences showed clustering of the CAstV/ INDIA/ANAND/2016 nearest to that of CastV/4175 and CAstV/GA2011 and all four chicken astrovirus formed separate cluster except capsid protein of the CAstV/Poland/G059/ 2014 isolate which was clustered along with the duck astroviruses. The analysis of capsid protein sequence of reported chicken astroviruses from India revealed limited structural divergence suggesting their common ancestral origin and recent emergence. Fig. 4 Phylogenetic relatedness of chicken astrovirus isolate CAstV/India/Anand/2016 ORF2 coding sequences (a) and ORF2 encoded capsid protein (b) with reported Indian isolates based on neighbour-joining method with
   abstract: OBJECTIVE: Chicken astroviruses have been known to cause severe disease in chickens leading to increased mortality and “white chicks” condition. Here we aim to characterize the causative agent of visceral gout suspected for astrovirus infection in broiler breeder chickens. METHODS: Total RNA isolated from allantoic fluid of SPF embryo passaged with infected chicken sample was sequenced by whole genome shotgun sequencing using ion-torrent PGM platform. The sequence was analysed for the presence of coding and non-coding features, its similarity with reported isolates and epitope analysis of capsid structural protein. RESULTS: The consensus length of 7513 bp genome sequence of Indian isolate of chicken astrovirus was obtained after assembly of 14,121 high quality reads. The genome was comprised of 13 bp 5′-UTR, three open reading frames (ORFs) including ORF1a encoding serine protease, ORF1b encoding RNA dependent RNA polymerase (RdRp) and ORF2 encoding capsid protein, and 298 bp of 3′-UTR which harboured two corona virus stem loop II like “s2m” motifs and a poly A stretch of 19 nucleotides. The genetic analysis of CAstV/INDIA/ANAND/2016 suggested highest sequence similarity of 86.94% with the chicken astrovirus isolate CAstV/GA2011 followed by 84.76% with CAstV/4175 and 74.48%% with CAstV/Poland/G059/2014 isolates. The capsid structural protein of CAstV/INDIA/ANAND/2016 showed 84.67% similarity with chicken astrovirus isolate CAstV/GA2011, 81.06% with CAstV/4175 and 41.18% with CAstV/Poland/G059/2014 isolates. However, the capsid protein sequence showed high degree of sequence identity at nucleotide level (98.64-99.32%) and at amino acids level (97.74–98.69%) with reported sequences of Indian isolates suggesting their common origin and limited sequence divergence. The epitope analysis by SVMTriP identified two unique epitopes in our isolate, seven shared epitopes among Indian isolates and two shared epitopes among all isolates except Poland isolate which carried all distinct epitopes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11259-016-9673-6) contains supplementary material, which is available to authorized users.
        url: https://www.ncbi.nlm.nih.gov/pubmed/28012117/
        doi: 10.1007/s11259-016-9673-6

         id: cord-341564-fvuwick5
     author: Qi, Zhao-Hui
      title: Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application
       date: 2018-06-12
      words: 2647.0
  sentences: 178.0
      pages: 
     flesch: 54.0
      cache: ./cache/cord-341564-fvuwick5.txt
        txt: ./txt/cord-341564-fvuwick5.txt
    summary: From these, we can see that physicochemical properties are widely applied with graphical representation of protein sequences by these researchers and their results seem well. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. In this article, we propose a 3-dimensional (3D) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the BLOSUM62 matrix. Therefore, to mine essential information from a protein sequence, we propose an effective graphical method combining physicochemical properties of amino acids and the BLOSUM62 matrix. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation F-Curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids
   abstract: In this article, we propose a 3-dimensional graphical representation of protein sequences based on 10 physicochemical properties of 20 amino acids and the BLOSUM62 matrix. It contains evolutionary information and provides intuitive visualization. To further analyze the similarity of proteins, we extract a specific vector from the graphical representation curve. The vector is used to calculate the similarity distance between 2 protein sequences. To prove the effectiveness of our approach, we apply it to 3 real data sets. The results are consistent with the known evolution fact and show that our method is effective in phylogenetic analysis.
        url: https://www.ncbi.nlm.nih.gov/pubmed/29977111/
        doi: 10.1177/1176934318777755

         id: cord-321715-bkfkmtld
     author: Redelings, Benjamin D
      title: Incorporating indel information into phylogeny estimation for rapidly emerging pathogens
       date: 2007-03-14
      words: 9793.0
  sentences: 546.0
      pages: 
     flesch: 54.0
      cache: ./cache/cord-321715-bkfkmtld.txt
        txt: ./txt/cord-321715-bkfkmtld.txt
    summary: To see if indel information improves phylogenetic resolution we compare the number of bi-partitions that are supported under the joint model and the traditional sequential approach, in which topology reconstruction assumes a previously determined alignment. These parameters include a multiple alignment A that specifies the positional homology between the sequences Y, an evolutionary tree (τ, T) where τ is an unrooted bifurcating tree topology and T = (t 1 , ..., t 2N -3 ) is a vector of branch lengths along the edges in τ, and vectors Θ and Λ are parameters that characterize the letter substitution and indel processes respectively. We therefore propose a new pairwise alignment prior that maintains a fixed sequence length distribution φ even when the indel probability varies from branch to branch. Since the joint model balances substitution and indel information as well as taking alignment ambiguity into account we assume that these differences represent an improvement in the accuracy of estimation.
   abstract: BACKGROUND: Phylogenies of rapidly evolving pathogens can be difficult to resolve because of the small number of substitutions that accumulate in the short times since divergence. To improve resolution of such phylogenies we propose using insertion and deletion (indel) information in addition to substitution information. We accomplish this through joint estimation of alignment and phylogeny in a Bayesian framework, drawing inference using Markov chain Monte Carlo. Joint estimation of alignment and phylogeny sidesteps biases that stem from conditioning on a single alignment by taking into account the ensemble of near-optimal alignments. RESULTS: We introduce a novel Markov chain transition kernel that improves computational efficiency by proposing non-local topology rearrangements and by block sampling alignment and topology parameters. In addition, we extend our previous indel model to increase biological realism by placing indels preferentially on longer branches. We demonstrate the ability of indel information to increase phylogenetic resolution in examples drawn from within-host viral sequence samples. We also demonstrate the importance of taking alignment uncertainty into account when using such information. Finally, we show that codon-based substitution models can significantly affect alignment quality and phylogenetic inference by unrealistically forcing indels to begin and end between codons. CONCLUSION: These results indicate that indel information can improve phylogenetic resolution of recently diverged pathogens and that alignment uncertainty should be considered in such analyses.
        url: https://www.ncbi.nlm.nih.gov/pubmed/17359539/
        doi: 10.1186/1471-2148-7-40

         id: cord-267500-x3u9i1vq
     author: Rose, Rebecca
      title: Challenges in the analysis of viral metagenomes
       date: 2016-08-03
      words: 5928.0
  sentences: 308.0
      pages: 
     flesch: 40.0
      cache: ./cache/cord-267500-x3u9i1vq.txt
        txt: ./txt/cord-267500-x3u9i1vq.txt
    summary: Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of lowfrequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. We recently proposed a method based on numerical sequence representations and digital signal processing data transformation (SPDT) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (Tapinos et al.
   abstract: Genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. Existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. There is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. In this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis.
        url: https://www.ncbi.nlm.nih.gov/pubmed/29492275/
        doi: 10.1093/ve/vew022

         id: cord-300149-djclli8n
     author: Ruan, Yijun
      title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection
       date: 2003-05-24
      words: 4355.0
  sentences: 226.0
      pages: 
     flesch: 54.0
      cache: ./cache/cord-300149-djclli8n.txt
        txt: ./txt/cord-300149-djclli8n.txt
    summary: title: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection METHODS: We sequenced the entire SARS viral genome of cultured isolates from the index case (SIN2500) presenting in Singapore, from three primary contacts (SIN2774, SIN2748, and SIN2677), and one secondary contact (SIN2679). In addition, a common variant associated with a non-conservative aminoacid change in the S1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the SARS virus in human populations. All genetic variations of Singapore isolates identified when compared with available SARS-CoV genome sequences were further confirmed by primer extension genotyping technology (Sequenom, San Diego, CA, USA). These sequences showed that the genomes of SARS-CoV isolated in Singapore are comprised of 29 711 bases, with the exception of a five-nucleotide deletion in strain SIN2748 and a six-nucleotide deletion in SIN2677.
   abstract: BACKGROUND: The cause of severe acute respiratory syndrome (SARS) has been identified as a new coronavirus. Whole genome sequence analysis of various isolates might provide an indication of potential strain differences of this new virus. Moreover, mutation analysis will help to develop effective vaccines. METHODS: We sequenced the entire SARS viral genome of cultured isolates from the index case (SIN2500) presenting in Singapore, from three primary contacts (SIN2774, SIN2748, and SIN2677), and one secondary contact (SIN2679). These sequences were compared with the isolates from Canada (TOR2), Hong Kong (CUHK-W1 and HKU39849), Hanoi (URBANI), Guangzhou (GZ01), and Beijing (BJ01, BJ02, BJ03, BJ04). FINDINGS: We identified 129 sequence variations among the 14 isolates, with 16 recurrent variant sequences. Common variant sequences at four loci define two distinct genotypes of the SARS virus. One genotype was linked with infections originating in Hotel M in Hong Kong, the second contained isolates from Hong Kong, Guangzhou, and Beijing with no association with Hotel M (p<0.0001). Moreover, other common sequence variants further distinguished the geographical origins of the isolates, especially between Singapore and Beijing. INTERPRETATION: Despite the recent onset of the SARS epidemic, genetic signatures are emerging that partition the worldwide SARS viral isolates into groups on the basis of contact source history and geography. These signatures can be used to trace sources of infection. In addition, a common variant associated with a non-conservative aminoacid change in the S1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the SARS virus in human populations. Published online May 9, 2003 http://image.thelancet.com/extras/03art4454web.pdf
        url: https://www.ncbi.nlm.nih.gov/pubmed/12781537/
        doi: 10.1016/s0140-6736(03)13414-9

         id: cord-015850-ef6svn8f
     author: Saitou, Naruya
      title: Eukaryote Genomes
       date: 2013-08-22
      words: 7424.0
  sentences: 484.0
      pages: 
     flesch: 53.0
      cache: ./cache/cord-015850-ef6svn8f.txt
        txt: ./txt/cord-015850-ef6svn8f.txt
    summary: General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] .
   abstract: General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Genomes of multicellular organisms, plants, fungi, and animals are then briefly discussed.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7119937/
        doi: 10.1007/978-1-4471-5304-7_8

         id: cord-264296-0x90yubt
     author: Sawmya, Shashata
      title: Analyzing hCov genome sequences: Applying Machine Intelligence and beyond
       date: 2020-06-03
      words: 5008.0
  sentences: 312.0
      pages: 
     flesch: 60.0
      cache: ./cache/cord-264296-0x90yubt.txt
        txt: ./txt/cord-264296-0x90yubt.txt
    summary: We present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. C. Several CNN-RNN based models are used to predict mutations at specific Sites of Interest (SoIs) of the sars-cov-2 genome sequence followed by further analyses of the same on several South-Asian countries. D. Overall, we present an analysis pipeline that can be further utilized as well as extended and revised (a) to study where a newly discovered genome sequence lies in relation to its predecessors in different regions of the world; (b) to analyse its virulence with respect to the number of deaths its predecessors have caused in their respective countries and (c) to analyse the mutation at specific important sites of the viral genome.
   abstract: Covid-19 pandemic, caused by the sars-cov-2 strain of coronavirus, has affected millions of people all over the world and taken thousands of lives. It is of utmost importance that the character of this deadly virus be studied and its nature be analysed. We present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. In a nutshell, we have prepared an analysis pipeline for hCov genome sequences leveraging the power of machine intelligence and uncovered what remained apparently shrouded by raw data.
        url: https://doi.org/10.1101/2020.06.03.131987
        doi: 10.1101/2020.06.03.131987

         id: cord-268467-btfz6ye8
     author: Schreiber, Steven S.
      title: Sequence analysis of the nucleocapsid protein gene of human coronavirus 229E
       date: 1989-03-31
      words: 5035.0
  sentences: 343.0
      pages: 
     flesch: 59.0
      cache: ./cache/cord-268467-btfz6ye8.txt
        txt: ./txt/cord-268467-btfz6ye8.txt
    summary: The 3′-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the Coronavirus family and lends support to the theory that this region is important for the replication of negative-strand RNA. This result suggested that the HCV229E subgenomic mRNAs possess a nested-set structure similar to other coronaviruses and that A34 represented a cDNA clone of either the 3''-end of the genomic RNA or the leader sequence. The 3''-noncoding region contains the sequence TGGAAGAGCCA, 75 nucleotides from the 3''-end (Fig. 4) which is relatively conserved among coronaviruses and is found at approximately the same location in all of these viral genomes (Kapke and Brian, 1986; Skinner and Siddell, 1984; Armstrong et a/., 1983; Lapps et al., 1987; Kamahora et a/., 1988; Boursnell et al., 1985) ( Table 1) . Three intergenic regions of coronavirus mouse hepatitis virus strain A59 genome RNA contain a common nucleotide sequence that is homologous to the 3''end of the viral mRNA leader sequence
   abstract: Abstract Human coronaviruses are important human pathogens and have also been implicated in multiple sclerosis. To further understand the molecular biology of human coronavirus 229E (HCV-229E), molecular cloning and sequence analysis of the viral RNA have been initiated. Following established protocols, the 3′-terminal 1732 nucleotides of the genome were sequenced. A large open reading frame encodes a 389 amino acid protein of 43,366 Da, which is presumably the nucleocapsid protein. The predicted protein is similar in size, chemical properties, and amino acid sequence to the nucleocapsid proteins of other coronaviruses. This is especially evident when the sequence is compared with that of the antigenically related porcine transmissible gastroenteritis virus (TGEV), with which a region of 46% amino acid sequence homology was found. Hydropathy profiles revealed the existence of several conserved domains which could have functional significance. An intergenic consensus sequence precedes the 5′-end of the proposed nucleocapsid protein gene. The consensus sequence is present in other coronaviruses and has been proposed as the site of binding of the leader sequence for mRNA transcriptional start. This region was also examined by primer extension analysis of mRNAs, which identified a 60-nucleotide leader sequence. The 3′-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the Coronavirus family and lends support to the theory that this region is important for the replication of negative-strand RNA.
        url: https://api.elsevier.com/content/article/pii/0042682289900500
        doi: 10.1016/0042-6822(89)90050-0

         id: cord-010273-0c56x9f5
     author: Simmonds, Peter
      title: Virology of hepatitis C virus
       date: 2001-10-10
      words: 7897.0
  sentences: 337.0
      pages: 
     flesch: 41.0
      cache: ./cache/cord-010273-0c56x9f5.txt
        txt: ./txt/cord-010273-0c56x9f5.txt
    summary: 1,2 The identification of HCV led to the development of diagnostic assays for infection, based either on detection of antibody to recombinant polypeptides expressed from cloned HCV sequences or direct detection of virus ribonucleic acid (RNA) sequences by polymerase chain reaction (PCR) using primers complimentary to the HCV genome. 6 ''13 Remarkably, a series of plant viruses that are structurally distinct from each of the mammalian virus groups, and with different genome organizations, have RNA-dependent RNA polymerase amino acid sequences that are perhaps more similar to those of HCV than are the flaviviruses. In contrast to the highly restricted sequence diversity of the 5''NCR and adjacent core region, the two putative envelope genes are highly divergent between different variants of HCV (Table III) 111-114 and show a three-to-four-times higher rate of sequence change with time in persistently infected patients, ll5 Because these proteins are likely to lie on the outside of the virus, they would be the principal targets of the humoral immune response to HCV elicited on infection.
   abstract: Hepatitis C virus (HCV) has been identified as the main causative agent of post-transfusion non-A, non-B hepatitis. Through recently developed diagnostic assays, routine serologic screening of blood donors has prevented most cases of post-transfusion hepatitis. The purpose of this paper is to comprehensively review current information regarding the virology of HCV. Recent findings on the genome organization, its relationship to other viruses, the replication of HCV ribonucleic acid, HCV translation, and HCV polyprotein expression and processing are discussed. Also reviewed are virus assembly and release, the variability of HCV and its classification into genotypes, the geographic distribution of HCV genotypes, and the biologic differences between HCV genotypes. The assays used in HCV genotyping are discussed in terms of reliability and consistency of results, and the molecular epidemiology of HCV infection is reviewed. These approaches to HCV epidemiology will prove valuable in documenting the spread of HCV in different risk groups, evaluating alternative (nonparenteral) routes of transmission, and in understanding more about the origins and evolution of HCV.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7173289/
        doi: 10.1016/s0149-2918(96)80193-7

         id: cord-213136-euv6pqh5
     author: Singh, Kulveer
      title: Sequence Effects on Internal Structure of Droplets of Associative Polymers
       date: 2020-05-17
      words: 4329.0
  sentences: 184.0
      pages: 
     flesch: 56.0
      cache: ./cache/cord-213136-euv6pqh5.txt
        txt: ./txt/cord-213136-euv6pqh5.txt
    summary: We study the evolution of internal structure of large droplets (morphology of clusters of stickers) and the kinetics of interconversion between intramolecular and intermolecular associations, for different sequences of our model polymers. Since at t = 0 we begin with a dilute solution of associating polymers in poor solvent in which most of the chains contain intramolecular bonds between their stickers, the observation of a second peak that corresponds to intermolecular bridges means that major molecular rearrangement takes place inside droplets formed by polymers with s8s, 1s6s1 and 2s4s2 sequences. For three of the sequences (s8s, 1s6s1 and 2s4s2) we found that the average spatial distance R ss between the two stickers of a polymer inside the condensed droplet has a bimodal distribution, such that one of the peaks corresponds to intramolecular bonds and the other to intermolecular bridges between clusters (or between different parts of a long fiber of stickers).
   abstract: We used Langevin dynamics simulations of short associative polymers with two stickers placed symmetrically along their contour to study the effect of the primary sequence of these polymers on their organization inside condensed droplets. We observed that the shape, size and number of sticker clusters inside the condensed droplet change from a single cylindrical fiber to many compact clusters, as one varies the location of stickers along the chain contour. Aging due to conversion of intramoleclular to intermolecular associations was observed in droplets of telechelic polymers, but not for other sequences of associating polymers. The relevance of our results to condensates of intrinsically disordered proteins is discussed.
        url: https://arxiv.org/pdf/2005.08246v1.pdf
        doi: nan

         id: cord-022348-w7z97wir
     author: Sola, Monica
      title: Drift and Conservatism in RNA Virus Evolution: Are They Adapting or Merely Changing?
       date: 2007-09-02
      words: 10892.0
  sentences: 671.0
      pages: 
     flesch: 56.0
      cache: ./cache/cord-022348-w7z97wir.txt
        txt: ./txt/cord-022348-w7z97wir.txt
    summary: An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships. Under the rubric replication, a virus could vary to increase its fitness, exploit different target cells or evade adaptive immune responses. For a given virus, different protein sequence sets were compared to a given reference such as RT in the case of HIV/SIV. Although these data were derived from completely sequenced primate immunodeficiency viral genomes, analyses on larger data sets, such as p17 Gag/p24 Gag or gp120/gp41, yielded relative values that differed from those given in Table 6 .1 by at most 14%. An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships (Table 6 .1). In the clear cases where genetic variation is exploited by RNA viruses, it is used to overcome barriers to transmission set up by the host population, e.g. herd immunity.
   abstract: This chapter argues that the vast majority of genetic changes or mutations fixed by RNA viruses are essentially neutral or nearly neutral in character. In molecular evolution one of the remarkable observations has been the uniformity of the molecular clock. An analysis of proteins derived from complete potyvirus genomes, positive-stranded RNA viruses, yielded highly significant linear relationships. These analyses indicate that viral protein diversification is essentially a smooth process, the major parameter being the nature of the protein more than the ecological niche it finds itself in. Synonymous changes are invariably more frequent than nonsynonymous changes. Positive selection exploits a small proportion of genetic variants, while functional sequence space is sufficiently dense, allowing viable solutions to be found. Although evolution has connotations of change, what has always counted is natural selection or adaptation. It is the only force for the genesis of a novel replicon.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7155598/
        doi: 10.1016/b978-012220360-2/50007-6

         id: cord-266960-kyx6xhvj
     author: Temple, Mark D.
      title: Real-time audio and visual display of the Coronavirus genome
       date: 2020-10-02
      words: 6780.0
  sentences: 360.0
      pages: 
     flesch: 56.0
      cache: ./cache/cord-266960-kyx6xhvj.txt
        txt: ./txt/cord-266960-kyx6xhvj.txt
    summary: The sonification of codons derived from all three reading frames of the viral RNA sequence in combination with sonified metadata provide the framework for this display. CONCLUSION: The auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the RNA genome. Audio generated from each of these sequence motifs and metadata were combined to create a complex auditory display to represent either transcription or translation. High resolution analysis of gene expression in Coronavirus genomes has detected ribosome protected fragments which map to non-canonical ORF''s, these may be novel protein-coding ORFs and short regulatory uORFs. The tool highlights the occurrence of one such uORF of 30 nucleotides (including the stop codon) in the 5′ untranslated region downstream of TRS1 [35] that is not documented in the GenBank metadata. In the Additional file 4: supplementary example ''Sonification Sub-genomic RNA'' the auditory display represents the process of transcription.
   abstract: BACKGROUND: This paper describes a web based tool that uses a combination of sonification and an animated display to inquire into the SARS-CoV-2 genome. The audio data is generated in real time from a variety of RNA motifs that are known to be important in the functioning of RNA. Additionally, metadata relating to RNA translation and transcription has been used to shape the auditory and visual displays. Together these tools provide a unique approach to further understand the metabolism of the viral RNA genome. This audio provides a further means to represent the function of the RNA in addition to traditional written and visual approaches. RESULTS: Sonification of the SARS-CoV-2 genomic RNA sequence results in a complex auditory stream composed of up to 12 individual audio tracks. Each auditory motive is derived from the actual RNA sequence or from metadata. This approach has been used to represent transcription or translation of the viral RNA genome. The display highlights the real-time interaction of functional RNA elements. The sonification of codons derived from all three reading frames of the viral RNA sequence in combination with sonified metadata provide the framework for this display. Functional RNA motifs such as transcription regulatory sequences and stem loop regions have also been sonified. Using the tool, audio can be generated in real-time from either genomic or sub-genomic representations of the RNA. Given the large size of the viral genome, a collection of interactive buttons has been provided to navigate to regions of interest, such as cleavage regions in the polyprotein, untranslated regions or each gene. These tools are available through an internet browser and the user can interact with the data display in real time. CONCLUSION: The auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the RNA genome. Furthermore, the tool has been used as an algorithmic based audio generator. These audio tracks can be listened to by the general community without reference to the visual display to encourage further inquiry into the science.
        url: https://doi.org/10.1186/s12859-020-03760-7
        doi: 10.1186/s12859-020-03760-7

         id: cord-300807-9u8idlon
     author: Tong, Joo Chuan
      title: 7 Infectious disease informatics
       date: 2013-12-31
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: Abstract: Throughout history, infectious diseases have posed a serious burden to mankind. More recently, there has been an alarming increase in drug-resistant microbes. Furthermore, new pathogens are emerging due to microbial evolution and adaptation. The spread of these diseases is a result of pathogen mutations and changes in human behavior patterns. Then, there are diseases that are lurking in the background, waiting for the right conditions before they strike again. In the war against these diseases, we have come to understand the behaviors of microbes in a heterogeneous world and the mechanisms governing disease transmission. These works have profoundly shaped modern knowledge of emerging and re-emerging infections. More recently, computational techniques have led the way into this new era by allowing rapid high-throughput analysis of pathogens which was previously not possible using traditional laboratory techniques. This chapter introduces methods in mathematical modeling, computational biology, and bioinformatics that have been used to study infectious diseases.
        url: https://api.elsevier.com/content/article/pii/B9781907568411500076
        doi: 10.1533/9781908818416.99

         id: cord-254942-g51mjj2b
     author: Touati, Rabeb
      title: New methodology for repetitive sequences identification in human X and Y chromosomes
       date: 2020-10-19
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: Repetitive DNA sequences occupy the major proportion of DNA in the human genome and even in the other species’ genomes. The importance of each repetitive DNA type depends on many factors: structural and functional roles, positions, lengths and numbers of these repetitions are clear examples. Conserving such DNA sequences or not in different locations in the chromosome remains a challenge for researchers in biology. Detecting their location despite their great variability and finding novel repetitive sequences remains a challenging task. To side-step this problem, we developed a new method based on signal and image processing tools. In fact, using this method we could find repetitive patterns in DNA images regardless of the repetition length. This new technique seems to be more efficient in detecting new repetitive sequences than bioinformatics tools. In fact, the classical tools present limited performances especially in case of mutations (insertion or deletion). However, modifying one or a few numbers of pixels in the image doesn’t affect the global form of the repetitive pattern. As a consequence, we generated a new repetitive patterns database which contains tandem and dispersed repeated sequences. The highly repetitive sequences, we have identified in X and Y chromosomes, are shown to be located in other human chromosomes or in other genomes. The data we have generated is then taken as input to a Convolutional neural network classifier in order to classify them. The system we have constructed is efficient and gives an average of 94.4% as recognition score.
        url: https://www.ncbi.nlm.nih.gov/pubmed/33101452/
        doi: 10.1016/j.bspc.2020.102207

         id: cord-301827-a7hnuxy5
     author: Uversky, Vladimir N
      title: A decade and a half of protein intrinsic disorder: Biology still waits for physics
       date: 2013-04-29
      words: 20971.0
  sentences: 1059.0
      pages: 
     flesch: 43.0
      cache: ./cache/cord-301827-a7hnuxy5.txt
        txt: ./txt/cord-301827-a7hnuxy5.txt
    summary: 94 Therefore, the abundance and peculiarities of the charged residues distribution within the protein sequences might determine physical and biological properties of extended IDPs and IDPRs. Also, simple polymer physics-based reasoning can give reasonably well-justified explanation of the conformational behavior of extended IDPs. In general, the conformational behavior of IDPs is characterized by the low cooperativity (or the complete lack thereof) of the denaturant-induced unfolding, lack of the measurable excess heat absorption peak(s) characteristic for the melting of ordered proteins, "turned out" response to heat and changes in pH, and the ability to gain structure in the presence of various binding partners. 183 This analysis revealed that proteins involved in regulation and execution of PCD possess substantial amount of intrinsic disorder and IDPRs were implemented in a number of crucial functions, such as protein-protein interactions, interactions with other partners including nucleic acids and other ligands, were shown to be enriched in post-translational modification sites, and were characterized by specific evolutionary patterns.
   abstract: The abundant existence of proteins and regions that possess specific functions without being uniquely folded into unique 3D structures has become accepted by a significant number of protein scientists. Sequences of these intrinsically disordered proteins (IDPs) and IDP regions (IDPRs) are characterized by a number of specific features, such as low overall hydrophobicity and high net charge which makes these proteins predictable. IDPs/IDPRs possess large hydrodynamic volumes, low contents of ordered secondary structure, and are characterized by high structural heterogeneity. They are very flexible, but some may undergo disorder to order transitions in the presence of natural ligands. The degree of these structural rearrangements varies over a very wide range. IDPs/IDPRs are tightly controlled under the normal conditions and have numerous specific functions that complement functions of ordered proteins and domains. When lacking proper control, they have multiple roles in pathogenesis of various human diseases. Gaining structural and functional information about these proteins is a challenge, since they do not typically “freeze” while their “pictures are taken.” However, despite or perhaps because of the experimental challenges, these fuzzy objects with fuzzy structures and fuzzy functions are among the most interesting targets for modern protein research. This review briefly summarizes some of the recent advances in this exciting field and considers some of the basic lessons learned from the analysis of physics, chemistry, and biology of IDPs.
        url: https://doi.org/10.1002/pro.2261
        doi: 10.1002/pro.2261

         id: cord-339209-oe8onyr9
     author: Vasilakis, Nikos
      title: Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range
       date: 2014-05-20
      words: 5817.0
  sentences: 272.0
      pages: 
     flesch: 46.0
      cache: ./cache/cord-339209-oe8onyr9.txt
        txt: ./txt/cord-339209-oe8onyr9.txt
    summary: The organization of each genome was similar to that described previously for the mesoniviruses (NDiV, CavV, HanaV, NseV and MenoV), featuring a long 5''-untranslated region (5''-UTR) of 359 to 370 nt, six major long open reading frames (ORFs), and a long terminal region of 1780 to 1804 nt preceding the poly[A] tail ( Figure 2 ). To determine the phylogenetic relationships of the newly identified insect viruses, maximum likelihood (ML) phylogenetic trees were constructed based on the amino acid alignments of ORF2a (unprocessed S protein) and a concatenated region of the highly conserved domains within ORF1ab (3CL pro , RdRp and ZnHel1). A Clustal X alignment of the mesonivirus ORF3a proteins and individual structural analyses using SignalP and TMHMM and NetNGlyc (www.expasy.org) indicated that each is a class I transmembrane glycoprotein with a predicted N-termimal signal peptide, an ectodomain containing a conserved set of 6 cysteine residues and a single conserved N-glycosylation site, a transmembrane domain and a C-terminal cytoplasmic domain ( Figure 4A, 4D) .
   abstract: BACKGROUND: The family Mesoniviridae (order Nidovirales) comprises of a group of positive-sense, single-stranded RNA ([+]ssRNA) viruses isolated from mosquitoes. FINDINGS: Thirteen novel insect-specific virus isolates were obtained from mosquitoes collected in Indonesia, Thailand and the USA. By electron microscopy, the virions appeared as spherical particles with a diameter of ~50 nm. Their 20,129 nt to 20,777 nt genomes consist of positive-sense, single-stranded RNA with a poly-A tail. Four isolates from Houston, Texas, and one isolate from Java, Indonesia, were identified as variants of the species Alphamesonivirus-1 which also includes Nam Dinh virus (NDiV) from Vietnam and Cavally virus (CavV) from Côte d’Ivoire. The eight other isolates were identified as variants of three new mesoniviruses, based on genome organization and pairwise evolutionary distances: Karang Sari virus (KSaV) from Java, Bontag Baru virus (BBaV) from Java and Kalimantan, and Kamphaeng Phet virus (KPhV) from Thailand. In comparison with NDiV, the three new mesoniviruses each contained a long insertion (180 – 588 nt) of unknown function in the 5’ region of ORF1a, which accounted for much of the difference in genome size. The insertions contained various short imperfect repeats and may have arisen by recombination or sequence duplication. CONCLUSIONS: In summary, based on their genome organizations and phylogenetic relationships, thirteen new viruses were identified as members of the family Mesoniviridae, order Nidovirales. Species demarcation criteria employed previously for mesoniviruses would place five of these isolates in the same species as NDiV and CavV (Alphamesonivirus-1) and the other eight isolates would represent three new mesonivirus species (Alphamesonivirus-5, Alphamesonivirus-6 and Alphamesonivirus-7). The observed spatiotemporal distribution over widespread geographic regions and broad species host range in mosquitoes suggests that mesoniviruses may be common in mosquito populations worldwide.
        url: https://doi.org/10.1186/1743-422x-11-97
        doi: 10.1186/1743-422x-11-97

         id: cord-296691-cg463fbn
     author: Wang, Ren
      title: De novo Sequence Assembly and Characterization of Lycoris aurea Transcriptome Using GS FLX Titanium Platform of 454 Pyrosequencing
       date: 2013-04-09
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: BACKGROUND: Lycoris aurea, also called Golden Magic Lily, is an ornamentally and medicinally important species of the Amaryllidaceae family. To date, the sequencing of its whole genome is unavailable as a non-model organism. Transcriptomic information is also scarce for this species. In this study, we performed de novo transcriptome sequencing to produce the first comprehensive expressed sequence tag (EST) dataset for L. aurea using high-throughput sequencing technology. METHODOLOGY AND PRINCIPAL FINDINGS: Total RNA was isolated from leaves with sodium nitroprusside (SNP), salicylic acid (SA), or methyl jasmonate (MeJA) treatment, stems, and flowers at the bud, blooming, and wilting stages. Equal quantities of RNA from each tissue and stage were pooled to construct a cDNA library. Using 454 pyrosequencing technology, a total of 937,990 high quality reads (308.63 Mb) with an average read length of 329 bp were generated. Clustering and assembly of these reads produced a non-redundant set of 141,111 unique sequences, comprising 24,604 contigs and 116,507 singletons. All of the unique sequences were involved in the biological process, cellular component and molecular function categories by GO analysis. Potential genes and their functions were predicted by KEGG pathway mapping and COG analysis. Based on our sequence analysis and published literatures, many putative genes involved in Amaryllidaceae alkaloids synthesis, including PAL, TYDC OMT, NMT, P450, and other potentially important candidate genes, were identified for the first time in this Lycoris. Furthermore, 6,386 SSRs and 18,107 high-confidence SNPs were identified in this EST dataset. CONCLUSIONS: The transcriptome provides an invaluable new data for a functional genomics resource and future biological research in L. aurea. The molecular markers identified in this study will provide a material basis for future genetic linkage and quantitative trait loci analyses, and will provide useful information for functional genomic research in future.
        url: https://www.ncbi.nlm.nih.gov/pubmed/23593220/
        doi: 10.1371/journal.pone.0060449

         id: cord-324216-ce3wa889
     author: Wang, Zheng
      title: Resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses
       date: 2008-12-01
      words: 5206.0
  sentences: 240.0
      pages: 
     flesch: 49.0
      cache: ./cache/cord-324216-ce3wa889.txt
        txt: ./txt/cord-324216-ce3wa889.txt
    summary: Due to the great genetic diversity of HRV and HEV, in order to ensure that designed probes (referred to as probe sequences) generated from selected database sequences (referred to as prototype regions) would detect and discriminate all serotypes of HRV and HEV, a predictive model was used to assist the microarray design [17] . This study demonstrated the use of an algorithm for the design of probe sets based on an in silico predictive model [17] , developed by our group, that minimized the probes needed for detection and identification of most serotypes of HRV and HEV. A powerful feature of the expanded RPM-Flu v.30/31 resequencing pathogen microarray is that the nucleotide sequences generated from hybridization of the sample RNA/DNA and array-bound probe sets in conjunction with previously developed sequence analysis algorithm CIBSI can be easily interpreted to make serotype or strain identifications.
   abstract: BACKGROUND: Febrile respiratory illness (FRI) has a high impact on public health and global economics and poses a difficult challenge for differential diagnosis. A particular issue is the detection of genetically diverse pathogens, i.e. human rhinoviruses (HRV) and enteroviruses (HEV) which are frequent causes of FRI. Resequencing Pathogen Microarray technology has demonstrated potential for differential diagnosis of several respiratory pathogens simultaneously, but a high confidence design method to select probes for genetically diverse viruses is lacking. RESULTS: Using HRV and HEV as test cases, we assess a general design strategy for detecting and serotyping genetically diverse viruses. A minimal number of probe sequences (26 for HRV and 13 for HEV), which were potentially capable of detecting all serotypes of HRV and HEV, were determined and implemented on the Resequencing Pathogen Microarray RPM-Flu v.30/31 (Tessarae RPM-Flu). The specificities of designed probes were validated using 34 HRV and 28 HEV strains. All strains were successfully detected and identified at least to species level. 33 HRV strains and 16 HEV strains could be further differentiated to serotype level. CONCLUSION: This study provides a fundamental evaluation of simultaneous detection and differential identification of genetically diverse RNA viruses with a minimal number of prototype sequences. The results demonstrated that the newly designed RPM-Flu v.30/31 can provide comprehensive and specific analysis of HRV and HEV samples which implicates that this design strategy will be applicable for other genetically diverse viruses.
        url: https://www.ncbi.nlm.nih.gov/pubmed/19046445/
        doi: 10.1186/1471-2164-9-577

         id: cord-022494-d66rz6dc
     author: Webb, B.
      title: Comparative Modeling of Drug Target Proteins
       date: 2014-10-01
      words: 8782.0
  sentences: 453.0
      pages: 
     flesch: 47.0
      cache: ./cache/cord-022494-d66rz6dc.txt
        txt: ./txt/cord-022494-d66rz6dc.txt
    summary: Comparative modeling consists of four main steps 23 (Figure 2 (a)): (1) fold assignment that identifies similarity between the target sequence of interest and at least one known protein structure (the template); (2) alignment of the target sequence and the template(s); (3) building a model based on the alignment with the chosen template(s); and (4) predicting model errors. Modeller implements comparative protein structure modeling by the satisfaction of spatial restraints that include: (1) homologyderived restraints on the distances and dihedral angles in the target sequence, extracted from its alignment with the template structures; 35 (2) stereochemical restraints such as bond length and bond angle preferences, obtained from the CHARMM-22 molecular mechanics force field; 107 (3) statistical preferences for dihedral angles and nonbonded interatomic distances, obtained from a representative set of known protein structures; 108 and (4) optional manually curated restraints, such as those from NMR spectroscopy, rules of secondary structure packing, cross-linking experiments, fluorescence spectroscopy, image reconstruction from electron microscopy, site-directed mutagenesis, and intuition ( Figure 2(b) ).
   abstract: In this perspective, we begin by describing the comparative protein structure modeling technique and the accuracy of the corresponding models. We then discuss the significant role that comparative prediction plays in drug discovery. We focus on virtual ligand screening against comparative models and illustrate the state-of-the-art by a number of specific examples.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7157477/
        doi: 10.1016/b978-0-12-409547-2.11133-3

         id: cord-311839-61djk4bs
     author: Wei, Dan
      title: A novel hierarchical clustering algorithm for gene sequences
       date: 2012-07-23
      words: 8033.0
  sentences: 496.0
      pages: 
     flesch: 61.0
      cache: ./cache/cord-311839-61djk4bs.txt
        txt: ./txt/cord-311839-61djk4bs.txt
    summary: We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. DMk shows better performance than the k-tuple distance in our experiments, and mBKM outperforms SL, CL, AL, BKM and KM when tested on public gene sequence datasets. In this paper we propose a new alignment-free similarity measure, DMk, based on which we developed mBKM to cluster gene sequences. To evaluate the proposed similarity measure, we test DMk on gene sequence data sets and compare it with the k-tuple distance. Moreover, we use our method, mBKM with similarity measure DMk, in phylogenetic analysis to show how well the genes are grouped together and how well the resulting trees agree with existing phylogenies. In order to illustrate the efficiency of mBKM in gene sequence clustering, we ran mBKM with the k-tuple distance and DMk on real data sets listed in Table 1 .
   abstract: BACKGROUND: Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors. RESULTS: The proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust, CD-HIT-EST and some others. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences. CONCLUSIONS: We introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences.
        url: https://doi.org/10.1186/1471-2105-13-174
        doi: 10.1186/1471-2105-13-174

         id: cord-343863-q1y8uscj
     author: Whitney, Joe
      title: Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches
       date: 2005-02-08
      words: 3463.0
  sentences: 179.0
      pages: 
     flesch: 61.0
      cache: ./cache/cord-343863-q1y8uscj.txt
        txt: ./txt/cord-343863-q1y8uscj.txt
    summary: ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. The complete ReHAB hits database can then be queried by date using a simple GUI to allow the researcher to easily identify new hits; these are highlighted, and pairwise or multiple alignments can be performed to assess the quality of the match. ReHAB consists of four main components ( Figure 1 ): (1) a MySQL relational database that stores information about hits, including biological sequences, alignments between them, and other categorization and annotation data; (2) a Java server that provides access to programs which cannot be run locally by the client on arbitrary user workstations, such as NCBI BLAST and EMBOSS [12] utilities; (3) a Java Swing graphical client, downloaded and launched on client machines using Java Web Start; (4) and a back-end Java program which runs alignment programs and compiles results in the database.
   abstract: BACKGROUND: Sequence similarity searching is a powerful tool to help develop hypotheses in the quest to assign functional, structural and evolutionary information to DNA and protein sequences. As sequence databases continue to grow exponentially, it becomes increasingly important to repeat searches at frequent intervals, and similarity searches retrieve larger and larger sets of results. New and potentially significant results may be buried in a long list of previously obtained sequence hits from past searches. RESULTS: ReHAB (Recent Hits Acquired from BLAST) is a tool for finding new protein hits in repeated PSI-BLAST searches. ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. Results are presented in an easily comprehended table, or in a BLAST-like report, using colors to highlight the new hits. ReHAB is designed to handle large numbers of query sequences, such as whole genomes or sets of genomes. Advanced computer skills are not needed to use ReHAB; the graphics interface is simple to use and was designed with the bench biologist in mind. CONCLUSIONS: This software greatly simplifies the problem of evaluating the output of large numbers of protein database searches.
        url: https://www.ncbi.nlm.nih.gov/pubmed/15701178/
        doi: 10.1186/1471-2105-6-23

         id: cord-103029-nc5yf6x4
     author: Wichmann, Stefan
      title: Computational design of genes encoding completely overlapping protein domains: Influence of genetic code and taxonomic rank
       date: 2020-09-25
      words: 8665.0
  sentences: 387.0
      pages: 
     flesch: 52.0
      cache: ./cache/cord-103029-nc5yf6x4.txt
        txt: ./txt/cord-103029-nc5yf6x4.txt
    summary: In this study the artificially designed sequences are compared to their original sequences in terms of amino acid identity, amino acid similarity, Hidden Markov Model profile and secondary structure in order to determine the impact of OLG construction and which sequences are potentially functional. While the previous study [30] tried to estimate an upper limit of how many domains can be successfully overlapped in at least one reading frame and position, here the average success rate for OLG construction is determined instead, which is more relevant in relation to both understanding constraints on the formation rate of naturally occuring OLGs and in assessing the likelihood of successful synthetic creation of OLGs. These results in one sense give an upper estimate of the ease of creating overlaps as the difficulty of obtaining an overlapping gene pair naturally is not directly addressed here.
   abstract: Overlapping genes (OLGs) with long protein-coding overlapping sequences are often excluded by genome annotation programs, with the exception of virus genomes. A recent study used a novel algorithm to construct OLGs from arbitrary protein domain pairs and concluded that virus genes are best suited for creating OLGs, a result which fitted with common assumptions. However, improving sequence evaluation using Hidden Markov Models shows that the previous result is an artifact originating from dataset-database biases. When parameters for OLG design and evaluation are optimized we find that 94.5% of the constructed OLG pairs score at least as highly as naturally occurring sequences, while 9.6% of the artificial OLGs cannot be distinguished from typical sequences in their protein family. Constructed OLG sequences are also indistinguishable from natural sequences in terms of amino acid identity and secondary structure, while the minimum nucleotide change required for overprinting an overlapping sequence can be as low as 1.8% of the sequence. Separate analysis of datasets containing only sequences from either archaea, bacteria, eukaryotes or viruses showed that, surprisingly, virus genes are much less suitable for designing OLGs than bacterial or eukaryotic genes. An important factor influencing OLG design is the structure of the standard genetic code. Success rates in different reading frames strongly correlate with their code-determined respective amino acid constraints. There is a tendency indicating that the structure of the standard genetic code could be optimized in its ability to create OLGs while conserving mutational robustness. The findings reported here add to the growing evidence that OLGs should no longer be excluded in prokaryotic genome annotations. Determining the factors facilitating the computational design of artificial overlapping genes may improve our understanding of the origin of these remarkable genetic constructs and may also open up exciting possibilities for synthetic biology.
        url: https://doi.org/10.1101/2020.09.25.312959
        doi: 10.1101/2020.09.25.312959

         id: cord-103297-4stnx8dw
     author: Widrich, Michael
      title: Modern Hopfield Networks and Attention for Immune Repertoire Classification
       date: 2020-08-17
      words: 14093.0
  sentences: 926.0
      pages: 
     flesch: 57.0
      cache: ./cache/cord-103297-4stnx8dw.txt
        txt: ./txt/cord-103297-4stnx8dw.txt
    summary: In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. DeepRC sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1D convolutions or LSTMs. In this work, we contribute the following: We demonstrate that continuous generalizations of binary modern Hopfield-networks (Krotov & Hopfield, 2016 Demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. We evaluate the predictive performance of DeepRC and other machine learning approaches for the classification of immune repertoires in a large comparative study (Section "Experimental Results") Exponential storage capacity of continuous state modern Hopfield networks with transformer attention as update rule
   abstract: A central mechanism in machine learning is to identify, store, and recognize patterns. How to learn, access, and retrieve such patterns is crucial in Hopfield networks and the more recent transformer architectures. We show that the attention mechanism of transformer architectures is actually the update rule of modern Hop-field networks that can store exponentially many patterns. We exploit this high storage capacity of modern Hopfield networks to solve a challenging multiple instance learning (MIL) problem in computational biology: immune repertoire classification. Accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the COVID-19 crisis. Immune repertoire classification based on the vast number of immunosequences of an individual is a MIL problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. We demonstrate that DeepRC outperforms all other methods with respect to predictive performance on large-scale experiments, including simulated and real-world virus infection data, and enables the extraction of sequence motifs that are connected to a given disease class. Source code and datasets: https://github.com/ml-jku/DeepRC
        url: https://doi.org/10.1101/2020.04.12.038158
        doi: 10.1101/2020.04.12.038158

         id: cord-253436-dz84icdc
     author: Wille, Michelle
      title: High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl
       date: 2016-03-03
      words: 2019.0
  sentences: 103.0
      pages: 
     flesch: 54.0
      cache: ./cache/cord-253436-dz84icdc.txt
        txt: ./txt/cord-253436-dz84icdc.txt
    summary: In this study we screened 764 samples from 22 avian species of the orders Anseriformes and Charadriiformes in Sweden collected in 2006/2007 for CoV, with an overall CoV prevalence of 18.7%, which is higher than many other wild bird surveys. Coronavirus sequences from Mallards in this study were highly similar to CoV sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. Despite few studies, small samples sizes and differences in prevalence, what is clear, is that in the Northern Hemisphere waterfowl species, especially dabbling and diving ducks are important in the epidemiology of avian CoVs. It is interesting to note that these patterns are very similar to those found in low pathogenic influenza A viruses: high prevalence in waterfowl and gulls in the Northern Hemisphere [30] , and little host species and temporal structuring within waterfowl derived viruses in the conserved polymerase genes (such as PB2, PB1) [31] .
   abstract: Coronaviruses (CoVs) are found in a wide variety of wild and domestic animals, and constitute a risk for zoonotic and emerging infectious disease. In poultry, the genetic diversity, evolution, distribution and taxonomy of some coronaviruses have been well described, but little is known about the features of CoVs in wild birds. In this study we screened 764 samples from 22 avian species of the orders Anseriformes and Charadriiformes in Sweden collected in 2006/2007 for CoV, with an overall CoV prevalence of 18.7%, which is higher than many other wild bird surveys. The highest prevalence was found in the diving ducks—mainly Greater Scaup (Aythya marila; 51.5%)—and the dabbling duck Mallard (Anas platyrhynchos; 19.2%). Sequences from two of the Greater Scaup CoV fell into an infrequently detected lineage, shared only with a Tufted Duck (Aythya fuligula) CoV. Coronavirus sequences from Mallards in this study were highly similar to CoV sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. A single Black-headed Gull represented the only positive sample from the order Charadriiformes. Globally, Anas species represent the largest fraction of avian CoV sequences, and there seems to be no host species, geographical or temporal structure. To better understand the eitiology, epidemiology and ecology of these viruses more systematic surveillance of wild birds and subsequent sequencing of detected CoV is imperative.
        url: https://doi.org/10.1371/journal.pone.0150198
        doi: 10.1371/journal.pone.0150198

         id: cord-280881-5o38ihe0
     author: Wlodawer, Alexander
      title: A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases
       date: 2003-11-11
      words: 4862.0
  sentences: 220.0
      pages: 
     flesch: 51.0
      cache: ./cache/cord-280881-5o38ihe0.txt
        txt: ./txt/cord-280881-5o38ihe0.txt
    summary: These structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (Ser, Glu, Asp) and by the presence of an Asp in the oxyanion hole [8] . We have now applied the tools of molecular homology modeling to predicting a structure of CLN2 that could be used as a basis for a search for the biological substrates of this family of enzymes and for the design of specific inhibitors. Mammalian enzymes homologous to human CLN2 [2, 4] form a subfamily of sedolisins with highly conserved sequences ( Figure 1 ). Exploiting the sequence similarity between CLN2, sedolisin, and kumamolisin ( Figure 4 ), we have now used the experimentally obtained structures of the latter two enzymes to form a new, homology-derived model of human CLN2.
   abstract: BACKGROUND: Tripeptidyl-peptidase I, also known as CLN2, is a member of the family of sedolisins (serine-carboxyl peptidases). In humans, defects in expression of this enzyme lead to a fatal neurodegenerative disease, classical late-infantile neuronal ceroid lipofuscinosis. Similar enzymes have been found in the genomic sequences of several species, but neither systematic analyses of their distribution nor modeling of their structures have been previously attempted. RESULTS: We have analyzed the presence of orthologs of human CLN2 in the genomic sequences of a number of eukaryotic species. Enzymes with sequences sharing over 80% identity have been found in the genomes of macaque, mouse, rat, dog, and cow. Closely related, although clearly distinct, enzymes are present in fish (fugu and zebra), as well as in frogs (Xenopus tropicalis). A three-dimensional model of human CLN2 was built based mainly on the homology with Pseudomonas sp. 101 sedolisin. CONCLUSION: CLN2 is very highly conserved and widely distributed among higher organisms and may play an important role in their life cycles. The model presented here indicates a very open and accessible active site that is almost completely conserved among all known CLN2 enzymes. This result is somehow surprising for a tripeptidase where the presence of a more constrained binding pocket was anticipated. This structural model should be useful in the search for the physiological substrates of these enzymes and in the design of more specific inhibitors of CLN2.
        url: https://www.ncbi.nlm.nih.gov/pubmed/14609438/
        doi: 10.1186/1472-6807-3-8

         id: cord-018963-2lia97db
     author: Xu, Ying
      title: Protein Structure Prediction by Protein Threading
       date: 2010-04-29
      words: 15309.0
  sentences: 716.0
      pages: 
     flesch: 48.0
      cache: ./cache/cord-018963-2lia97db.txt
        txt: ./txt/cord-018963-2lia97db.txt
    summary: Their follow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now.
   abstract: The seminal work of Bowie, Lüthy, and Eisenberg (Bowie et al., 1991) on “the inverse protein folding problem” laid the foundation of protein structure prediction by protein threading. By using simple measures for fitness of different amino acid types to local structural environments defined in terms of solvent accessibility and protein secondary structure, the authors derived a simple and yet profoundly novel approach to assessing if a protein sequence fits well with a given protein structural fold. Their follow-up work (Elofsson et al., 1996; Fischer and Eisenberg, 1996; Fischer et al., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et al., 1992) on protein fold recognition led to the development of a new brand of powerful tools for protein structure prediction, which we now term “protein threading.” These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many of the proteins encoded in the hundreds of genomes that have been sequenced up to now.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7123984/
        doi: 10.1007/978-0-387-68825-1_1

         id: cord-010499-yefxrj30
     author: Yelverton, Elizabeth
      title: The function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in Escherichia coli
       date: 2006-10-27
      words: 5883.0
  sentences: 330.0
      pages: 
     flesch: 60.0
      cache: ./cache/cord-010499-yefxrj30.txt
        txt: ./txt/cord-010499-yefxrj30.txt
    summary: Ribosomal frameshifting in both rightward and leftward directions has also been shown to occur at certain ''hungry'' codons whose cognate aminoacyi-tRNAs are in short supply (Gallant and Foley, 1980; Weiss and Gailant, 1983; 1986; Gallant et ai, 1985; Kurland and Gallant, 1986) . Not all hungry codons are equally prone to shift: in a survey of 21 frameshift mutations of the rllB gene of phage T4, Weiss and Gallant (1986) found that oniy a minority were phenotypicaily suppressible when challenged by limitation for any of several aminoacyl-tRNAs. The context njies governing ribosome frameshifting at hungry sites are under investigation, and have been defined in a few cases (Weiss et al., 1988; Gallant and Lindsiey, 1992; Peter et ai. coli the rate of ribosomal frameshifting on that sequence can be increased by limitation for leucine, the amino acid encoded at the frameshift site.
   abstract: A 15‐17 nucleotide sequence from the gag‐pol ribosome frameshift site of HIV‐1 directs analogous ribosomal frameshifting in Escherichia coli. Limitation for leucine, which is encoded precisely at the frameshift site, dramatically increased the frequency of leftward frameshifting. Limitation for phenylaianine or arginine, which are encoded just before and just after the frameshift, did not significantly affect frameshifting. Protein sequence analysis demonstrated the occurrence of two closeiy related frameshift mechanisms. In the first, ribosomes appear to bind leucyl‐tRNA at the frameshift site and then slip leftward. This is the 'simultaneous slippage’mechanism. In the second, ribosomes appear to slip before binding amlnoacyl‐tRNA, and then bind phenylaianyl‐tRNA, which is encoded in the left‐shifted reading frame. This mechanism is identicai to the‘overlapping reading’we have demonstrated at other bacterial frameshift sites. The HIV‐1 sequence is prone to frame‐shifting by both mechanisms in E. coli.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7192232/
        doi: 10.1111/j.1365-2958.1994.tb00310.x

         id: cord-005060-n901y2d4
     author: ZHANG, Feiyun
      title: Complete Nucleotide Sequence of Ryegrass Mottle Virus : A New Species of the Genus Sobemovirus
       date: 2001
      words: 2602.0
  sentences: 173.0
      pages: 
     flesch: 62.0
      cache: ./cache/cord-005060-n901y2d4.txt
        txt: ./txt/cord-005060-n901y2d4.txt
    summary: The largest ORF 2 encodes a polyprotein of 947 amino acids (103.6 kDa), which codes for a serine protease and an RNA-dependent RNA polymerase. The genome sequence of sobernoviruses has been determined in Southern bean mosaic virus (SBMV)''2,24), CfMV8315), Rice yellow mottle virus (RYMV)") and Lucerne transient streak virus (LTSV, accession number U31286). However, the con-served sequence, WAG + E/D rich sequence is detected in the region, and putative E/S cleavage sites are present on both sides of the region : proteolytic cleavage would result in a protein of 9 kDa. Possibly, the VPg of RGMoV is located between the protease and the RNA-dependent RNA polymerase domains in the same order as in the SBMV ORF 222) (Fig. 3) . In the RGMoV RNA sequence, no ORF corresponds to the second largest product of 68 kDa. The putative replicase of CfMV is translated as part of a single polyprotein by -1 ribosomal frameshifting between two overlapping ORFs having a coding capacity for 60.9 kDa and 56.3 kDa proteins7J8).
   abstract: The genome of Ryegrass mottle virus (RGMoV) comprises 4210 nucleotides. The genomic RNA contains four open reading frames (ORFs). The largest ORF 2 encodes a polyprotein of 947 amino acids (103.6 kDa), which codes for a serine protease and an RNA-dependent RNA polymerase. The viral coat protein is encoded on ORF 4 present at the 3′-proximal region. Other ORFs 1 and 3 encode the predicted 14.6 kDa and 19.8 kDa proteins of unknown function. The consensus signal for frameshifting, heptanucleotide UUUAAAC and a stem-loop structure just downstream is in front of the AUG codon of ORF 3. Analysis of the in vitro translation products of RGMoV RNA suggests that the 68 kDa protein may represent a fusion protein of ORF 2-ORF 3 produced by frameshifting. The protease region of the polyprotein and coat protein have a low similarity with that of the sobemoviruses (approximately 25% amino acid identity), while the RNA-dependent RNA polymerase region has particularly strong similarity (54 to 60% of more than 350 amino acid residues). The sequence similarities of RGMoV to the sobemoviruses, together with the characteristic genome organization indicate that RGMoV is a new species of the genus Sobemovirus.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7088213/
        doi: 10.1007/pl00012989

         id: cord-340907-j9i1wlak
     author: Zarai, Yoram
      title: Evolutionary selection against short nucleotide sequences in viruses and their related hosts
       date: 2020-04-27
      words: 8162.0
  sentences: 415.0
      pages: 
     flesch: 45.0
      cache: ./cache/cord-340907-j9i1wlak.txt
        txt: ./txt/cord-340907-j9i1wlak.txt
    summary: Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. Figure 3A and B depicts the average number of under-represented sequences of size m ¼ 3, 4, and 5 nucleotides, identified in few subsets of viruses in both the original and random variants of the virus. A sampling analysis that we performed (see Supplementary document, Section 2.8) suggests that the number of under-represented sequences identified in dsDNA viruses matches their genomic size, when compared with RNA viruses. To show that the correspondence between selection against short palindromic sequences in viruses and restriction sites cannot be explained by basic coding region features such as amino-acid content and order, codon usage bias and dinucleotide distribution, we also evaluated the overlap between restriction sites and common under-represented sequences of random variants of viruses.
   abstract: Viruses are under constant evolutionary pressure to effectively interact with the host intracellular factors, while evading its immune system. Understanding how viruses co-evolve with their hosts is a fundamental topic in molecular evolution and may also aid in developing novel viral based applications such as vaccines, oncologic therapies, and anti-bacterial treatments. Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. These sequences cannot be explained by the coding regions’ amino acid content, codon, and dinucleotide frequencies. We specifically show that short homooligonucleotide and palindromic sequences tend to be under-represented in many viruses probably due to their effect on gene expression regulation and the interaction with the host immune system. In addition, we show that more sequences tend to be under-represented in dsDNA viruses than in other viral groups. Finally, we demonstrate, based on in vitro and in vivo experiments, how under-represented sequences can be used to attenuated Zika virus strains.
        url: https://www.ncbi.nlm.nih.gov/pubmed/32339222/
        doi: 10.1093/dnares/dsaa008

         id: cord-266794-oyppubq5
     author: Zhang, Dachuan
      title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model
       date: 2020-09-01
      words: 1003.0
  sentences: 75.0
      pages: 
     flesch: 48.0
      cache: ./cache/cord-266794-oyppubq5.txt
        txt: ./txt/cord-266794-oyppubq5.txt
    summary: title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. To circumvent this limitation, we built an integrated 2019-nCoV scientific resource platform and a consensus sequence-catalytic function model with which we developed novel methodology to analyze pathogen sequences for catalytic functions. In addition, we integrated a consensus sequence-function model (Zhang, et al., 2020) , a genome browser (Ham, et al., 2012) , and a catalytic function annotation tool (Dawson, et al., 2017) into the platform to assist in the research of novel viruses. We built an integrated platform to assist 2019-nCoV research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species.
   abstract: MOTIVATION: The 2019 novel coronavirus outbreak has significantly affected global health and society. Thus, predicting biological function from pathogen sequence is crucial and urgently needed. However, little work has been performed to identify viruses by the enzymes that they encode, and which are key to pathogen propagation. RESULTS: We built a comprehensive scientific resource, SARS2020, that integrates coronavirus-related research, genomic sequences, and results of anti-viral drug trials. In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. This data-driven sequence-based strategy will enable rapid identification of agents responsible for future epidemics. AVAILABILITY: SARS2020 is available at http://design.rxnfinder.org/sars2020/. SUPPLEMENTARY INFORMATION:
        url: https://www.ncbi.nlm.nih.gov/pubmed/32871007/
        doi: 10.1093/bioinformatics/btaa767

         id: cord-344782-ond1ziu5
     author: Zhang, Jing
      title: Identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered Bellinger River snapping turtle (Myuchelys georgesi)
       date: 2018-10-24
      words: 6003.0
  sentences: 280.0
      pages: 
     flesch: 49.0
      cache: ./cache/cord-344782-ond1ziu5.txt
        txt: ./txt/cord-344782-ond1ziu5.txt
    summary: Nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. Following the detection of the novel virus, in November 2015 (about 6 months after the cessation of the outbreak) an intensive survey of the parts of the river where affected turtles had been detected [2] was undertaken by groups of biologists and ecologists and samples collected from a wide range of aquatic species and some terrestrial animals (n = 360) to establish the size of the remaining population and whether any other animals were carrying this virus. BRV, as a novel nidovirus, was isolated from tissues of diseased animals, very high levels of viral RNA were detected in tissues with marked pathological changes and in situ hybridisation assays demonstrated the presence of specific viral RNA in lesions in kidneys and eye tissue-two of the main affected organs.
   abstract: In mid-February 2015, a large number of deaths were observed in the sole extant population of an endangered species of freshwater snapping turtle, Myuchelys georgesi, in a coastal river in New South Wales, Australia. Mortalities continued for approximately 7 weeks and affected mostly adult animals. More than 400 dead or dying animals were observed and population surveys conducted after the outbreak had ceased indicated that only a very small proportion of the population had survived, severely threatening the viability of the wild population. At necropsy, animals were in poor body condition, had bilateral swollen eyelids and some animals had tan foci on the skin of the ventral thighs. Histological examination revealed peri-orbital, splenic and nephric inflammation and necrosis. A virus was isolated in cell culture from a range of tissues. Nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. Its closest relatives are nidoviruses that have recently been described in pythons and lizards, usually in association with respiratory disease. In contrast, in the affected turtles, the most significant pathological changes were in the kidneys. Real time PCR assays developed to detect this virus demonstrated very high virus loads in affected tissues. In situ hybridisation studies confirmed the presence of viral nucleic acid in tissues in association with pathological changes. Collectively these data suggest that this virus is the likely cause of the mortalities that now threaten the survival of this species. Bellinger River Virus is the name proposed for this new virus.
        url: https://doi.org/10.1371/journal.pone.0205209
        doi: 10.1371/journal.pone.0205209

         id: cord-193910-7p3f3znj
     author: Zhang, Xiangxie
      title: Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification
       date: 2020-11-01
      words: 7724.0
  sentences: 436.0
      pages: 
     flesch: 59.0
      cache: ./cache/cord-193910-7p3f3znj.txt
        txt: ./txt/cord-193910-7p3f3znj.txt
    summary: In the experiments, the performances of feature extraction using primers and random DNA sequences will be compared to several other machine learning approaches. Finally, three state-of-the-art methods, namely a con-volutional neural network (CNN), a deep neural network (DNN), and an N-gram probabilistic model, which were fed the unprocessed DNA sequences without prior feature extraction, were tested. Different machine learning algorithms will be trained and tested using each set of feature vectors in the experiments. For each data set, the results of all six machine learning algorithms using the random DNA sequence feature extraction method are presented in Table ( 8) containing mean accuracy and standard deviation over the ten folds of the cross-validation. It can be concluded that the Levenshtein distance feature extraction yields the best and most consistent results across the six different machine learning algorithms when the distance between a primer and a DNA sequence is taken.
   abstract: The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art algorithms, namely Convolutional Neural Networks, Deep Neural Networks, and N-gram Probabilistic Models, are used for the task of DNA classification. Furthermore, we introduce a novel feature extraction method based on the Levenshtein distance and randomly generated DNA sub-sequences to compute information-rich features from the DNA sequences. We also use an existing feature extraction method based on 3-grams to represent amino acids and combine both feature extraction methods with a multitude of machine learning algorithms. Four different data sets, each concerning viral diseases such as Covid-19, AIDS, Influenza, and Hepatitis C, are used for evaluating the different approaches. The results of the experiments show that all methods obtain high accuracies on the different DNA datasets. Furthermore, the domain-specific 3-gram feature extraction method leads in general to the best results in the experiments, while the newly proposed technique outperforms all other methods on the smallest Covid-19 dataset
        url: https://arxiv.org/pdf/2011.00485v1.pdf
        doi: nan

         id: cord-031957-df4luh5v
     author: dos Santos-Silva, Carlos André
      title: Plant Antimicrobial Peptides: State of the Art, In Silico Prediction and Perspectives in the Omics Era
       date: 2020-09-02
      words: nan
  sentences: nan
      pages: 
     flesch: nan
      cache: 
        txt: 
    summary: 
   abstract: Even before the perception or interaction with pathogens, plants rely on constitutively guardian molecules, often specific to tissue or stage, with further expression after contact with the pathogen. These guardians include small molecules as antimicrobial peptides (AMPs), generally cysteine-rich, functioning to prevent pathogen establishment. Some of these AMPs are shared among eukaryotes (eg, defensins and cyclotides), others are plant specific (eg, snakins), while some are specific to certain plant families (such as heveins). When compared with other organisms, plants tend to present a higher amount of AMP isoforms due to gene duplications or polyploidy, an occurrence possibly also associated with the sessile habit of plants, which prevents them from evading biotic and environmental stresses. Therefore, plants arise as a rich resource for new AMPs. As these molecules are difficult to retrieve from databases using simple sequence alignments, a description of their characteristics and in silico (bioinformatics) approaches used to retrieve them is provided, considering resources and databases available. The possibilities and applications based on tools versus database approaches are considerable and have been so far underestimated.
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7476358/
        doi: 10.1177/1177932220952739

         id: cord-001835-0s7ok4uw
     author: nan
      title: Abstracts of the 29th Annual Symposium of The Protein Society
       date: 2015-10-01
      words: 138514.0
  sentences: 6150.0
      pages: 
     flesch: 40.0
      cache: ./cache/cord-001835-0s7ok4uw.txt
        txt: ./txt/cord-001835-0s7ok4uw.txt
    summary: Altogether, these results indicate that, although PHDs might be more selective for HIF as a substrate as it was initially thought, the enzymatic activity of the prolyl hydroxylases is possibly influenced by a number of other proteins that can directly bind to PHDs. Non-natural aminoacids via the MIO-enzyme toolkit Alina Filip 1 , Judith H Bartha-V ari 1 , Gergely B an oczy 2 , L aszl o Poppe 2 , Csaba Paizs 1 , Florin-Dan Irimie 1 1 Biocatalysis and Biotransformation Research Group, Department of Chemistry, UBB, 2 Department of Organic Chemistry and Technology An attractive enzymatic route to enantiomerically pure to the highly valuable a-or b-aromatic amino acids involves the use of aromatic ammonia lyases (ALs) and aminomutases (AMs). Continuing our studies of the effect of like-charged residues on protein-folding mechanisms, in this work, we investigated, by means of NMR spectroscopy and molecular-dynamics simulations, two short fragments of the human Pin1 WW domain [hPin1(14-24); hPin1(15-23)] and one single point mutation system derived from hPin1(14-24) in which the original charged residues were replaced with non-polar alanine residues.
   abstract: nan
        url: https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/pro.2823
        doi: 10.1002/pro.2823

         id: cord-004879-pgyzluwp
     author: nan
      title: Programmed cell death
       date: 1994
      words: 81677.0
  sentences: 4465.0
      pages: 
     flesch: 51.0
      cache: ./cache/cord-004879-pgyzluwp.txt
        txt: ./txt/cord-004879-pgyzluwp.txt
    summary: Furthermore kinetic experiments after complementation of HIV=RT p66 with KIV-RT pSl indicated that HIV-RT pSl can restore rate and extent of strand displacement activity by HIV-RT p66 compared to the HIV-RT heterodimer D66/D51, suggesting a function of the 51 kDa polypeptide, The mouse mammary tumor virus proviral DNA contains an open reading frame in the 3'' long terminal repeat which can code for a 36 kDa polypeptide with a putative transmembrane sequence and five N-linked glycosylation sites. To this end we used constructs encoding the c-fos (and c-jun) genes fused to the hormone-binding domain of the human estrogen receptor, designated c-FosER (and c-JunER), We could show that short-term activation (30 mins.) of c-FosER by estradiole (E2) led to the disruption of epithelial cell polarity within 24 hours, as characterized by the expression of apical and basolateral marker proteins.
   abstract: nan
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7087532/
        doi: 10.1007/bf02033112

         id: cord-014462-11ggaqf1
     author: nan
      title: Abstracts of the Papers Presented in the XIX National Conference of Indian Virological Society, “Recent Trends in Viral Disease Problems and Management”, on 18–20 March, 2010, at S.V. University, Tirupati, Andhra Pradesh
       date: 2011-04-21
      words: 35453.0
  sentences: 1711.0
      pages: 
     flesch: 49.0
      cache: ./cache/cord-014462-11ggaqf1.txt
        txt: ./txt/cord-014462-11ggaqf1.txt
    summary: Molecular diagnosis based on reverse transcription (RT)-PCR s.a. one step or nested PCR, nucleic acid sequence based amplification (NASBA), or real time RT-PCR, has gradually replaced the virus isolation method as the new standard for the detection of dengue virus in acute phase serum samples. Non-genetic methods of management of these diseases include quarantine measures, eradication of infected plants and weed hosts, crop rotation, use of certified virus-free seed or planting stock and use of pesticides to control insect vector populations implicated in transmission of viruses. The results of this study indicate that NS1 antigen based ELISA test can be an useful tool to detect the dengue virus infection in patients during the early acute phase of disease since appearance of IgM antibodies usually occur after fifth day of the infection. The studies showed high level of expression in case of constructed vector as compared to infected virus for the specific protein.
   abstract: nan
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3639731/
        doi: 10.1007/s13337-011-0027-2

         id: cord-014674-ey29970v
     author: nan
      title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002
       date: 2003
      words: 2522.0
  sentences: 181.0
      pages: 
     flesch: 62.0
      cache: ./cache/cord-014674-ey29970v.txt
        txt: ./txt/cord-014674-ey29970v.txt
    summary: title: Dreizehnter Bericht nach Inkrafttreten des Gentechnikgesetzes (GenTG) für den Zeitraum vom 1.1.2002 bis 31.12.2002 : Die Arbeit der Zentralen Kommission für die Biologische Sicherheit (ZKBS) im Jahr 2002 We have closely examined the experimental data and the analyses of the nucleotide sequences presented in the report.We find that aside from problematic details of the experimental design and some erratic presentations of the data the results of the study do not provide evidence for the introgression of recombinant DNA from transgenic crop plants into the genomes of ''criollo'' maize. 3. We characterized with the help of BLAST searches those parts of the sequences of the iPCR amplification products that were denoted by Quist and Chapela in their Fig.2 as regions flanking the CMV p-35S sequence.We find that the sequence of AF434754 denoted adh1 in the K1 source of Fig. 2 does not match with the maize adh1 gene. We examined whether the identified regions in the maize genomic DNA from which PCR amplification products were obtained by the authors would perhaps be flanked by primer binding sites.
   abstract: nan
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7079883/
        doi: 10.1007/s00103-003-0614-5

         id: cord-023208-w99gc5nx
     author: nan
      title: Poster Presentation Abstracts
       date: 2006-09-01
      words: 70854.0
  sentences: 3492.0
      pages: 
     flesch: 43.0
      cache: ./cache/cord-023208-w99gc5nx.txt
        txt: ./txt/cord-023208-w99gc5nx.txt
    summary: In order to develop a synthetic protocol by an automated instrumentation, increasing yield, purity of the crude, and reaction time, a microwave-assisted solid phase peptide synthesis was validated comparing the use of the new generation of Triazine-Based Coupling Reagents (TBCRs) with a series of commonly used ones. Ubiquitinium is a well known mechanism in protein degredation of Eukaryotic cells ,in which many obsolte and corrupted three dimentional structure protein ,become marked by covalent attachment of ubuquitin through a multi-step enzymatic pathway.Ubiquitin is a small ,8.5 kDa peptide of 76 amino acid residues that targets such substrtes for proteolysis in proteasome .Recnt studies showed that an extra cellular ubiquitination process also taking place in the epididymes of humans and other animals marks protein on the surface of the defective sperm .it appears that structurally and functionally defective sperm become surface ubiquitinated by epididymal epithelial cells. This head-to-tailcyclized 14-amino-acid peptide contains one disulfide bridge and a lysine residue (Lys5) present in the P1 position, which is responsible for inhibitor specificity.As was reported by us and other groups, SFTI-1 analogues with one cycle only retain trypsin inhibitory activity.
   abstract: nan
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7167816/
        doi: 10.1002/psc.797

         id: cord-023209-un2ysc2v
     author: nan
      title: Poster Presentations
       date: 2008-10-07
      words: 111878.0
  sentences: 5398.0
      pages: 
     flesch: 45.0
      cache: ./cache/cord-023209-un2ysc2v.txt
        txt: ./txt/cord-023209-un2ysc2v.txt
    summary: Site-specifi c PEGylation of human IgG1-Fab using a rationally designed trypsin variant In the present contribution we report on a novel, highly selective biocatalytic method enabling C-terminal modifi cations of proteins with artifi cial functionalities under native state conditions. Recently, our group report a novel approach to a totally synthetic vaccine which consists of FMDV (Foot and Mouth Disease Virus) VP1 peptides, prepared by covalent conjugation of peptide biomolecules with membrane active carbochain polyelectrolytes In the present study, peptide epitops of VP1 protein both 135-161(P1) amino acid residues (Ser-Lys-Tyr-Ser-Thr-Thr-Gly-Glu-Arg-Thr-Arg-Thr-Arg-Gly-Asp-Leu-Gly-Ala-Leu-Ala-Ala-Arg-Val-Ala-Thr-Gln-Leu-Pro-Ala) and triptophan (Trp) containing on the N terminus 135-161 amino acid residues (Trp-135-161) (P2) were synthesized by using the microwave assisted solid-phase methods. Using as a template a peptide, already identifi ed, with agonist activity against PTPRJ(H-[Cys-His-His-Asn-Leu-Thr-His-Ala-Cys]-OH), here we report a structure-activity study carried out through endocyclic modifi cations (Ala-scan, D-substitutions, single residue deletions, substitutions of the disulfi de bridge) and the preliminary biological results of this set of compounds.
   abstract: nan
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7167823/
        doi: 10.1002/psc.1090

         id: cord-023647-dlqs8ay9
     author: nan
      title: Sequences and topology
       date: 2003-03-21
      words: 4505.0
  sentences: 747.0
      pages: 
     flesch: 69.0
      cache: ./cache/cord-023647-dlqs8ay9.txt
        txt: ./txt/cord-023647-dlqs8ay9.txt
    summary: Nucleotide Sequence Analysis of the L G~ne of Vesicular Stomafltia Virus (New Jersey Serotype) --Identification of Conserved Domai~L~ in L Proteins of Nonsegmented Negative-Strand RNA Viruses DERSE I~ Equine Infectious Anemia Virus tat--Insights into the Structure, Function, and Evolution of Lentivtrus tran.~Activator Proteins Ho~tu~ ~ s71 is a Ehylngcueticellly Distinct Human Endogenous Reteovtgal 1Rlement with Structural mad Sequence Homology to Simian Sarcoma Virus (SSV). Distinct Fercedoxins from Rhodobacter-Capsulstus -Complete Amino Acid Sequences and Molecular Evolution Complete Amino Acid Sequence and Homologies of Human Erythrocyte Membrane Protein Band 4.2. Identification of Two Highly Conserved Amino Acid Sequences Amon~ the ~x-subunits and Molecular ~ The Predicted Amino Acid Sequence of ct-lnternexin is that of a novel Neuronal lntegmedla~ ~ent Protein Inttaspecific Evolution of a Gene Family Coding for Urinary Proteins Attalysi~ of CDNA for Human ~ AJudgyrin I~dicltes a Repeated Structure with Homology to Tissue-Differentiation a~td Cell-Cycle Control Protein
   abstract: nan
        url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7173161/
        doi: 10.1016/0959-440x(91)90051-t

         id: cord-300796-rmjv56ia
     author: nan
      title: The signal sequence of the p62 protein of Semliki Forest virus is involved in initiation but not in completing chain translocation
       date: 1990-09-01
      words: 8031.0
  sentences: 405.0
      pages: 
     flesch: 57.0
      cache: ./cache/cord-300796-rmjv56ia.txt
        txt: ./txt/cord-300796-rmjv56ia.txt
    summary: In this work we show that the p62 protein of Semliki Forest virus contains an uncleaved signal sequence at its NH2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. As the glycosylation of the signal sequence most likely occurs after its release from the ER membrane our results suggest that this region has no role in completing the transfer process. Furthermore, the p62-reporter hybrid should be translocated across microsomal membranes and possibly glycosylated at Asn~3 of the p62 sequence if the 40 residues long NH2-terminal p62 peptide carries a signal sequence. This must involve Asn~3 of the p62 peptide as it is part of the only potential glycosylation site on the hybrid polypeptides (Garoff et al., 1980 ; references on dhfr sequence in legend to Fig. 1) , Finally, we can also conclude that the p62 signal sequence does not provide a stable membrane anchor to the translocated chain.
   abstract: So far it has been demonstrated that the signal sequence of proteins which are made at the ER functions both at the level of protein targeting to the ER and in initiation of chain translocation across the ER membrane. However, its possible role in completing the process of chain transfer (see Singer, S. J., P. A. Maher, and M. P. Yaffe. Proc. Natl. Acad. Sci. USA. 1987. 84:1015-1019) has remained elusive. In this work we show that the p62 protein of Semliki Forest virus contains an uncleaved signal sequence at its NH2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. As the glycosylation of the signal sequence most likely occurs after its release from the ER membrane our results suggest that this region has no role in completing the transfer process.
        url: https://www.ncbi.nlm.nih.gov/pubmed/2391367/
        doi: nan

         id: cord-256608-ajzk86rq
     author: van Weezep, Erik
      title: PCR diagnostics: In silico validation by an automated tool using freely available software programs
       date: 2019-05-13
      words: 4950.0
  sentences: 258.0
      pages: 
     flesch: 54.0
      cache: ./cache/cord-256608-ajzk86rq.txt
        txt: ./txt/cord-256608-ajzk86rq.txt
    summary: An alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the PCR test as search queries and the program SSEARCH available in the FASTA sequence analysis package (Brenner et al., 1998; Pearson, 1991; Pearson et al., 2017; . The in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective PCR test. To demonstrate the suitability of our in-house developed software tool PCRv, we determined the in silico sensitivity and specificity of three PCR tests for West Nile virus (WNV) recommended by the World Organisation for Animal Health (OIE) (Eiden et al., 2010; Johnson et al., 2001) .
   abstract: PCR diagnostics are often the first line of laboratory diagnostics and are regularly designed to either differentiate between or detect all pathogen variants of a family, genus or species. The ideal PCR test detects all variants of the target pathogen, including newly discovered and emerging variants, while closely related pathogens and their variants should not be detected. This is challenging as pathogens show a high degree of genetic variation due to genetic drift, adaptation and evolution. Therefore, frequent re-evaluation of PCR diagnostics is needed to monitor its usefulness. Validation of PCR diagnostics recognizes three stages, in silico, in vitro and in vivo validation. In vitro and in vivo testing are usually costly, labour intensive and imply a risk of handling dangerous pathogens. In silico validation reduces this burden. In silico validation checks primers and probes by comparing their sequences with available nucleotide sequences. In recent years the amount of available sequences has dramatically increased by high throughput and deep sequencing projects. This makes in silico validation more informative, but also more computing intensive. To facilitate validation of PCR tests, a software tool named PCRv was developed. PCRv consists of a user friendly graphical user interface and coordinates the use of the software programs ClustalW and SSEARCH in order to perform in silico validation of PCR tests of different formats. Use of internal control sequences makes the analysis compliant to laboratory quality control systems. Finally, PCRv generates a validation report that includes an overview as well as a list of detailed results. In-house developed, published and OIE-recommended PCR tests were easily (re-) evaluated by use of PCRv. To demonstrate the power of PCRv, in silico validation of several PCR tests are shown and discussed.
        url: https://doi.org/10.1016/j.jviromet.2019.05.002
        doi: 10.1016/j.jviromet.2019.05.002

==== make-pages.sh questions [ERIC WAS HERE]
==== make-pages.sh search
/data-disk/reader-compute/reader-cord/bin/make-pages.sh: line 77: /data-disk/reader-compute/reader-cord/tmp/search.htm: No such file or directory
Traceback (most recent call last):
  File "/data-disk/reader-compute/reader-cord/bin/tsv2htm-search.py", line 51, in <module>
    with open( TEMPLATE, 'r' ) as handle : htm = handle.read()
FileNotFoundError: [Errno 2] No such file or directory: '/data-disk/reader-compute/reader-cord/tmp/search.htm'
==== make-pages.sh topic modeling corpus
Zipping study carrel