key: cord-0740564-sni8pnoc authors: De Silva, Nishadi H.; Bhai, Jyothish; Chakiachvili, Marc; Contreras-Moreira, Bruno; Cummins, Carla; Frankish, Adam; Gall, Astrid; Genez, Thiago; Howe, Kevin L.; Hunt, Sarah E.; Martin, Fergal J.; Moore, Benjamin; Ogeh, Denye; Parker, Anne; Parton, Andrew; Ruffier, Magali; Sakthivel, Manoj Pandian; Sheppard, Dan; Tate, John; Thormann, Anja; Thybert, David; Trevanion, Stephen J.; Winterbottom, Andrea; Zerbino, Daniel R.; Finn, Robert D.; Flicek, Paul; Yates, Andrew D. title: The Ensembl COVID-19 resource: Ongoing integration of public SARS-CoV-2 data date: 2021-03-29 journal: bioRxiv DOI: 10.1101/2020.12.18.422865 sha: 90202c6794fe4bc791dcdcc1216579140302f055 doc_id: 740564 cord_uid: sni8pnoc The COVID-19 pandemic has seen unprecedented use of SARS-CoV-2 genome sequencing for epidemiological tracking and identification of emerging variants. Understanding the potential impact of these variants on the infectivity of the virus and the efficacy of emerging therapeutics and vaccines has become a cornerstone of the fight against the disease. To support the maximal use of genomic information for SARS-CoV-2 research, we launched the Ensembl COVID-19 browser, incorporating a new Ensembl gene set, multiple variant sets (including novel variation calls), and annotation from several relevant resources integrated into the reference SARS-CoV-2 assembly. This work included key adaptations of existing Ensembl genome annotation methods to model ribosomal slippage, stringent filters to elucidate the highest confidence variants and utilisation of our comparative genomics pipelines on viruses for the first time. Since May 2020, the content has been regularly updated and tools such as the Ensembl Variant Effect Predictor have been integrated. The Ensembl COVID-19 browser is freely available at https://covid-19.ensembl.org. Over the past twenty years, multiple zoonotic respiratory diseases caused by coronaviruses have The Ensembl COVID-19 resource: Ongoing integration of public SARS-CoV-2 data 3 Archive (ENA) into an Ensembl database schema with minor modifications to software regularly used 65 to integrate assemblies from the ENA into Ensembl. To enable the correct annotation of SARS-CoV-2, the Ensembl gene annotation methods 6 were 68 adapted to reflect the biology of the virus. To identify protein coding genes, we aligned SARS-CoV-2 69 proteins from RefSeq 7 to the genome using Exonerate 8 . A challenge for annotation is that the first and Our annotation approach was tested on 90 additional SARS-CoV-2 assemblies retrieved from the 78 ENA. We assessed alignment coverage and percentage identity of the resultant gene translations to 79 verify accuracy and consistency. In all cases, full length alignments were observed and average 80 amino acid percentage identity across all genes in most assemblies were 99.9% or 100% (one 81 assembly had 99.81% identity). These results demonstrate that our annotation approach is able to 82 scale consistently to larger volumes of viral data. In addition to generating a fully integrated Ensembl gene annotation, we also imported the gene set subunits: S1 which binds to the host cell receptor angiotensin-converting enzyme 2 (ACE2) and S2, The Ensembl COVID-19 resource: Ongoing integration of public SARS-CoV-2 data 5 which is involved in membrane fusion. The region of the S ORF encoding for the S2 subunit of the 120 spike protein clearly displays a high alignment coverage while the region encoding for the S1 subunit 121 has large portions that are shared only by one other related genome. This demonstrates the dramatic 122 difference in conservations between the S1 and S2 subunits. Additionally, we applied our gene tree method 12 to group the protein coding genes into families and to 134 predict orthologous and paralogous relationships between genes. These results will be incorporated 135 into the COVID-19 resource in Q2 2021. The Ensembl COVID-19 resource: Ongoing integration of public SARS-CoV-2 data 8 Finally, we provide tracks to visualise problematic and caution sites, which result from common 204 systematic errors associated with laboratory protocols and have been observed in submitted 205 sequences 16 . Inclusion of these can adversely influence phylogenetic and evolutionary inference. Visualising these in the browser alongside the locations of primers and other community derived 207 annotations helps determine how best to proceed with analyses of each these sites. We have engaged our existing and new user communities using our blog and social media accounts 233 to announce the release and updates to the Ensembl COVID-19 resource. We also highlighted 234 the changes made to our gene annotation method to ensure the complete set of ORFs because these The Ensembl COVID-19 resource: Ongoing integration of public SARS-CoV-2 data 10 alignment pipelines have been applied to the viral data with only minimal changes to parameters. We 258 will continue to regularly update the site as new data emerges to support research into understanding 259 the genomic evolution of this virus, identifying hotspots of genomic variation and enabling the rational 260 design of future therapeutics, vaccines and policies well beyond the end of the current pandemic. Genome Composition and Divergence of the Novel Coronavirus Ensembl Genomes 2020-enabling non-vertebrate genomic research A new coronavirus associated with human respiratory disease in China The UCSC SARS-CoV-2 Genome Browser The Ensembl gene annotation system Reference sequence (RefSeq) database at NCBI: current status, 275 taxonomic expansion, and functional annotation Automated generation of heuristics for biological sequence 277 comparison Emerging coronaviruses: Genome structure, replication, and 279 pathogenesis SARS-CoV-2 gene content and COVID-19 mutation 281 impact by comparing 44 Sarbecovirus genomes Progressive Cactus is a multiple-genome aligner for the thousand-genome 284 era Ensembl comparative genomics resources The Ensembl Variant Effect Predictor Nextstrain: real-time tracking of pathogen evolution LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering 290 cell-population heterogeneity from high-throughput sequencing datasets Issues with SARS-CoV-2 sequencing data Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on 295 Transmissibility and Pathogenicity The newly introduced SARS-CoV-2 variant A222V is rapidly spreading in 297 Lazio region The circulating SARS-CoV-2 spike variant N439K maintains fitness while 299 evading antibody-mediated immunity Rfam 14: expanded coverage of metagenomic, viral and microRNA families A Phylogenetically Conserved Hairpin-Type 3′ Untranslated Region Pseudoknot Functions in Coronavirus RNA Replication UniProt: the universal protein knowledgebase in 2021 Evolutionary and structural analyses of SARS-CoV-2 D614G spike protein 308 mutation now documented worldwide We would like to thank the following people at the EMBL-EBI for their contributions to our resource 312 and thoughtful discussions: Nick Goldman, Zamin Iqbal Luca Da Rin Fioretto, Thomas Maurel and Vinay 315 Kaikala The selection of our code to convert CSV files into BigBed files is at 345 https://github.com/Ensembl/sarscov2-annotation. The code relevant to processing SARS-CoV-2 346 variants in Ensembl is at https://github.com/Ensembl/ensembl-variation, the gene annotation pipeline 347 is available at https://github.com/Ensembl/ensembl-annotation and the code used for comparative 348 analysis is at https://github.com/Ensembl/ensembl-compara