key: cord-0865425-9vuoc1if authors: Andreotti, Sandro; Altmüller, Janine; Quedenau, Claudia; Borodina, Tatiana; Nouailles, Geraldine; Teixeira Alves, Luiz Gustavo; Landthaler, Markus; Bieniara, Maximilian; Trimpert, Jakob; Wyler, Emanuel title: De Novo Whole Genome Assembly of the Roborovski Dwarf Hamster (Phodopus roborovskii) Genome, an Animal Model for Severe/Critical COVID-19 date: 2021-10-09 journal: bioRxiv DOI: 10.1101/2021.10.02.462569 sha: 0caace437af68b11bc15558b906b52bde401e996 doc_id: 865425 cord_uid: 9vuoc1if The Roborovski dwarf hamster Phodopus roborovskii belongs to the Phodopus genus, one of seven within Cricetinae subfamily. Like other rodents such as mice, rats or ferrets, hamsters can be important animal models for a range of diseases. Whereas the Syrian hamster from the genus Mesocricetus is now widely used as a model for mild to moderate COVID-19, Roborovski dwarf hamster show a severe to lethal course of disease upon infection with the novel human coronavirus SARS-CoV-2. The ongoing pandemic caused by the human severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has made clear that traditional animal models such as mice and rats are not always suitable to study novel diseases and moreover, multiple animal models might be required to adequately reflect a variety of possible disease manifestations (Muñoz-Fontela et al. 2020) . In fact, the importance of host factors has become strikingly evident in coronavirus disease 2019 , as the same virus causes disease severities that span from asymptomatic infections to severe acute respiratory distress syndrome (ARDS) and fatal multi-organ dysfunction (Guan, NEJM 2020) . To solve immune mechanisms, identify putative targets of interventions and to test novel therapies and vaccination regimens animals models, ideally small animal models that reflect all presentations of COVID-19, are required (Veenhuis & Zeiss 2021; Lee & Lowen 2021) . Non-transgenic mice and rats could not be productively infected with and consequently showed no weight loss or lung pathology in response to SARS-CoV-2 wildtype infection (Muñoz-Fontela et al. 2020; Bao et al. 2020; Dinnon et al. 2020; Gu et al. 2020; Hassan et al. 2020; Sun et al. 2020) . In order to identify suitable models, species studied in the context of COVID-19 were chosen based on similarity in the SARS-CoV-2 receptor angiotensin converting enzyme-2 (ACE-2) predicted in silico (Devaux et al. 2021; Wu et al. 2020; Pach et al. 2020) . COVID-19 research comprised non-human primates (Deng et al. 2020; Lu et al. 2020; Yu et al. 2020) , cats (Shi et al. 2020) , ferrets (Kim et al. 2020; Richard et al. 2020) , and hamsters (Bertzbach et al. 2021 ) as animal models. Most small animals introduced to date show mild to moderate disease symptoms with resolving infections, including Syrian hamster (Chan et al. 2020; Imai et al. 2020; Kreye et al. 2020; Sia et al. 2020; Osterrieder et al. 2020) . We previously introduced a dwarf hamster species, Phodopus roborovskii, as a representative model for a severe course of disease including systemic immune activation and fatal disease outcome (Trimpert et al. 2020; Zhai et al. 2021) . In our study, despite very similar ACE-2 sequences amongst the three analyzed Phodopus species, Roborovski dwarf hamsters (P. roborovskii), Campbell's dwarf hamsters (P. campbelli), and Djungarian hamsters (P. sungorus), only the Roborovski dwarf hamster showed a severe disease manifestation following intranasal SARS-CoV-2 infection. The natural habitat of P. roborovskii are the sandy deserts of Northern china, where they mainly eat seeds and insects. Within groups, amicable interactions are slightly more frequent than aggressive ones, corresponding to the overall more social interactions within Phodopus compared to e.g. Mesocricetus (golden hamster) species (Wilson et al. 2009 ). The Roborovski dwarf hamster is, so far, the only non-transgenic animal that consistently develops severe disease and hyper-inflammation of the lung following infection with SARS-CoV-2 (Gantier 2021; Muñoz-Fontela et al. 2020) . Clinical signs develop within the first 48 hours following infection and include drastic reduction in body temperature, substantial weight loss, forced breathing, ruffled fur and lethargy. By histopathology, massive alveolar destruction and microthrombosis are evident in the lungs of infected animals while other organs, including the brain, do not seem to be primarily involved in disease development. The rapid onset and fulminant course of pulmonary disease makes this species a valuable model to study severe courses of COVID-19 in humans and test therapies and vaccinations in the background of severe disease (Muñoz-Fontela et al. 2020; Gantier 2021) . The novelty of this model however entails a problematic lack of reagents and tools to study immune reactions and other host factors. The absence of classical tools for molecular biology makes transcriptome and proteome analyses only more important as they may help to understand molecular reasons for severe COVID-19 and could supply information that helps finding reasonable medical interventions. Prerequisite for genomics and proteomics studies is a thoroughly annotated publicly available genome. Since the closest annotated and available genome comes from a species in a different genus (Mesocricetus auratus, MesAur1.0), we describe here a scaffold-level genome assembly based on long and short read DNA sequencing, and annotated using RNA-sequencing from heart and lung of P. roborovskii animals. Isolation and sequencing of genomic DNA Genomic DNA was isolated from a whole blood sample of an animal of about seven weeks of age. From the same DNA sample, libraries were prepared for Promethion long read sequencing and Illumina short read sequencing. The final assembly comprises a total of 2,078 (2,055 > 50 kb) contigs with a total length of 2.38 gb, an N50 of 25.78 mb and an L50 of 30 (Supplementary Table S1 ). According to QUAST, 99.75% of 676.47 M paired-end short reads and 99.74% of 4.13 M long reads were mapped yielding average read depths of 80 and 34 respectively. The positive effect of genome assembly polishing using the described toolchain was confirmed by the genome completeness analysis with BUSCO. While the raw assembly as produced by Canu has a completeness of 79.3%, this value was improved to 85.9% with racon, 88.8% with Medaka and finally reaches 92.0% after short read polishing with POLCA (Supplementary Table S2 ). In the final step of the analysis, the screening with Kraken2, three contigs remained unclassified, three were classified as bacteria (total length: 314.68 kb), 74 as human and the remaining 2028 matched either the golden hamster or mouse. Of the 74 contigs classified as human 47 passed the final BLAST check and were included in the final cleaned assembly composed of 2,078 contigs. Before quality and adapter trimming and filtering, the four RNA-Seq samples had between 31.2 and 13.5 million reads of which 96.4% to 98.1% passed the preprocessing stage and between 71.6% and 87.3% were uniquely mapped to the assembly with multi-mapping rates between 9.1% and 18.3% (Supplementary Table S3 ). The final cleaned and curated annotation based on the prediction with the GeMoMa pipeline comprises 22,139 predicted transcripts in a total of 18,029 annotated gene loci. Ethics statement on animal husbandry Roborovski dwarf hamsters were obtained through the German pet trade and housed in IVC units (Tecniplast). Hamsters were provided ad libidum with food and water and supplied with various enrichment materials (Carfil). For DNA extraction and sequencing, whole blood was obtained from uninfected control animals of a SARS-CoV-2 infection trial (Trimpert et al. 2020 ) that was performed according to all applicable regulations and approved by the relevant state authority (Landesamt für Gesundheit und Soziales, Berlin, Approval Number 0086/20). RNA was extracted from SARS-CoV-2 infected and non-infected hamsters subject to an independent experiment (Trimpert et al. 2021 ) under the same permit. Briefly, anaesthetized hamsters were infected with 1x105 focus forming units of SARS-CoV-2 (variant B.1, strain SARS-CoV-2/München-1.1/2020/929) in 20 µL cell culture medium. Animals were euthanized for sample collection on days 2 and 3 post infection as previously described (Trimpert et al. 2020) . Following the 3R principle, all material used for this study was obtained from animals subject to independent animal experiments, no additional animals were used. Isolation of genomic DNA 100 µL previously frozen whole blood was lysed by addition of 400 µL lysis solution CBV (Analytik Jena) and 10 µL proteinase K (20 mg/ml, Analytik Jena) followed by an incubation for 10 minutes at 70 °C. Following this lysis step, another 10 µL proteinase K were added to perform an extended protein digestion for 30 minutes at 50 °C. DNA was extracted using a standard phenol/chloroform extraction with a first step of adding 1 ml liquefied TE saturated phenol (Carl Roth), gentle mixing by inverting the tube 20 times and centrifugation at 10000 g for 10 minutes. The aqueous phase was aspirated with a cut pipette tip, mixed with 1 ml phenol/chloroform/isoamyl alcohol (25:24:1, Carl Roth) and mixed and centrifuged again as stated above. Again, the aqueous supernatant was carefully removed, mixed with 1 ml chloroform (Merck) and centrifuged for phase separation. The remaining aqueous phase was mixed with 1 ml absolute ethanol (Merck) and centrifuged for 30 minutes at 15000 g for DNA precipitation. All steps were carried out with cut pipette tips and very gentle mixing to avoid shearing of the DNA. Lung pieces were stored in RNAlater (ThermoFisher) for about 4 hours before extraction. Afterwards, the tissue was lysed in a homogenizer (Eppendorf) in Trizol (ThermoFisher). For extraction of total RNA from whole blood, 250 µL anticoagulated (EDTA) sample was lysed by addition of 750 µL Trizol LS reagent (Thermo Fisher). RNA was purified from Trizol using the Direct-zol RNA mini kit (Zymo Research) according to the manufacturer's instructions. For short read DNA sequencing, 1ug of DNA were sonicated (Bioruptor, Diagenod), and the Illumina TruSeq DNA nano kit applied, using a slightly modified protocol with only one cycle of PCR to complete adapter structures. Following library validation and quantification (Agilent tape station, Peqlab KAPA Library Quantification Kit and the Applied Biosystems 7900HT Sequence Detection System), sequencing was performed on an Illumina NovaSeq 6000 instrument with 2x150 paired-end sequencing. Sequencing libraries for long-read sequencing were prepared from 2.5 µg of unsheared genomic DNA, following the protocol of OxfordNanopore's LSK109 kit (ONT, Oxford). [https://store.nanoporetech.com/eu/media/wysiwyg/pdfs/SQK-LSK109/Genomic_DNA_by_Li gation_SQK-LSK109_-minion.pdf] DNA was end-repaired, A-tailed and purified (1x of Ampure XP beads,BeckmanCoulter). Then sequencing adapter with attached motor protein was ligated and the DNA was purified with 0.4x of Ampure beads. Quality and quantity of libraries were checked using HS gDNA 50 kb kit on fragment analyzer (Agilent) and dsDNA Qubit assay (Thermofisher). Libraries were loaded three times, 30 fmol / 30 fmol / 14 fmol per 24h, i.e. 74 fmol in total. The complete runtime was 72 hours. Poly(A)+ sequencing libraries were generated from total RNA using the NEBNext Ultra II Directional RNA Library Prep Kit (New England Biolabs) according to the manufacturer's instruction, and sequenced on a NextSeq 500 device with single-end 76 cycles. De novo genome assembly Raw unprocessed reads were assembled using Canu assembler (Koren et al. 2017) with an estimated genome size of 2.1gb and default parameters. This initial assembly was improved in a multi-step procedure. As a first step we mapped raw long reads using minimap2 (Li 2018 ) and applied the Racon (Vaser et al. 2017 ) assembly polishing tool. In the next step, the Racon-polished genome was further improved with Medaka (https://nanoporetech.github.io/medaka) using again the raw long reads mapped with the mini_align script provided by the Medaka package, which also uses minimap2. For performance reasons, the polishing followed a recommendation in Medaka's documentation and we first split the contigs into ten almost equally sized sets. Each set was processed using the subprogram medaka consensus and finally merged with medaka stitch. As a last polishing step, the result of Medaka was polished using the genomic short reads with Polca (Zimin & Salzberg 2020). Short reads were first trimmed and quality filtered using bbduk (https://sourceforge.net/projects/bbmap/) and filtered reads were used as input for Polca, which is based on bwa-mem read mapping and subsequent variant calling with freebayes (Garrison & Marth 2012) . For evaluation of assembly quality and polishing effects, we applied the quality assessment tools QUAST (Mikheenko et al. 2018) and BUSCO (Simão et al. 2015) for estimation of genome completeness. At the final stage of the assembly process we performed contamination screening of the polished assembly based on Kraken2 (Wood et al. 2019) with database option standard augmented with genomes Mesocricetus auratus and Mus musculus, retaining those contigs classified either as one of these two species or unclassified. Contigs classified as bacteria were removed and those classified as human were further analyzed using BLASTN (Altschul et al. 1990 ) with database nt to eliminate possible contamination with human genetic material. For every contig we summed up the bitscores per taxid for all hits with e-values below 1e-25 and assigned the species with the highest summed score. All contigs with hits to order Rodentia and or without any hits passing the threshold remained in the final assembly. The versions and options for all tools in the bioinformatics toolchain are given in Supplemental Table S4 . Genome annotation RNA-Seq reads were quality trimmed and adapter sequences were removed with Cutadapt (Martin 2011 ) and filtered reads were mapped to the final polished assembly using the mapper STAR (Dobin et al. 2013) . The mapped reads, together with closely related reference genomes and annotations of Mus musculus (GRCm38.102), Rattus norvegicus (Rnor_6.0.102) and Mesocricetus auratus (MesAur1.0.100) -obtained from ENSEMBLwere used a input for the hybrid genome annotation tool GeMoMa (Keilwagen et al. 2016 (Keilwagen et al. , 2018 to predict gene loci. The mapped RNA-Seq reads were also used in a subsequent prediction of 5' and 3' UTRs. Finally, resulting gff files were converted to gtf format using GffRead (Pertea & Pertea 2020) and augmented with the original gene name(s) of the associated gene from the reference genomes with a custom Python script. Afterwards the annotation was cleaned according to the following scheme: If transcripts annotated for a single locus matched different gene names, only transcripts associated to the same gene name as the highest scoring (GeMoMa score) transcript for this locus were retained. In a second step, if the same gene name was associated with multiple annotated loci, only the locus with the higher top score was retained. In another post-processing step we removed exons shared by multiple genes as these fusions were artifacts introduced by GeMoMa's UTR inference. Finally, for transcripts with identical exon boundaries, all but the one with the longest CDS were removed with the script agat_sp_fix_features_locations_duplicated.pl from the AGAT toolkit (Dainat et al. 2021) . Finally, as we observed that annotated 3' UTRs were frequently too short, we extended them by a constant number of 1000 bp whenever their distance to the next annotated feature (same or opposite strand) was at least 3000 bp. Supplementary Material: Basic local alignment search tool The pathogenicity of SARS-CoV-2 in hACE2 transgenic mice SARS-CoV-2 infection of Chinese hamsters ( Cricetulus griseus ) reproduces COVID-19 pneumonia in a well-established small animal model Simulation of the Clinical and Pathological Manifestations of Coronavirus Disease 2019 (COVID-19) in a Golden Syrian Hamster Model: Implications for Disease Pathogenesis and Transmissibility Ocular conjunctival inoculation of SARS-CoV-2 can cause mild COVID-19 in rhesus macaques Can ACE2 Receptor Polymorphism Predict Species Susceptibility to SARS-CoV-2? Front. Public Health A mouse-adapted model of SARS-CoV-2 to test COVID-19 countermeasures STAR: ultrafast universal RNA-seq aligner Animal models of COVID-19 hyper-inflammation Haplotype-based variant detection from short-read sequencing Adaptation of SARS-CoV-2 in BALB/c mice for testing vaccine efficacy A SARS-CoV-2 Infection Model in Mice Demonstrates Protection by Neutralizing Antibodies Syrian hamsters as a small animal model for SARS-CoV-2 infection and countermeasure development Using intron position conservation for homology-based gene prediction Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi Infection and Rapid Transmission of SARS-CoV-2 in Ferrets Canu: scalable and accurate long-read assembly via adaptive k -mer weighting and repeat separation A Therapeutic Non-self-reactive SARS-CoV-2 Antibody Protects from Lung Pathology in a COVID-19 Hamster Model Animal models for SARS-CoV-2 Minimap2: pairwise alignment for nucleotide sequences Birol, I, editor Comparison of nonhuman primates identified the suitable model for COVID-19 Cutadapt removes adapter sequences from high-throughput sequencing reads Versatile genome assembly evaluation with QUAST-LG Animal models for COVID-19 Age-Dependent Progression of SARS-CoV-2 Infection in Syrian Hamsters ACE2-Variants Indicate Potential SARS-CoV-2-Susceptibility in Animals: An Extensive Molecular Dynamics Study GFF Utilities: GffRead and GffCompare. F1000Research SARS-CoV-2 is transmitted via contact and via the air between ferrets Susceptibility of ferrets, cats, dogs, and other domesticated animals to SARS-coronavirus 2 Pathogenesis and transmission of SARS-CoV-2 in golden hamsters BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs Generation of a Broadly Useful Model for COVID-19 Pathogenesis, Vaccination, and Treatment Development of safe and highly protective live-attenuated SARS-CoV-2 vaccine candidates by genome recoding The Roborovski Dwarf Hamster Is A Highly Susceptible Model for a Rapid and Fatal Course of SARS-CoV-2 Infection Fast and accurate de novo genome assembly from long uncorrected reads Animal Models of COVID-19 II Rodents II. Lynx Edicions : Conservation International Improved metagenomic analysis with Kraken 2 Silico Analysis of Intermediate Hosts and Susceptible Animals of SARS-CoV-2 Age-related rhesus macaque models of COVID-19 Roborovski hamster (Phodopus roborovskii) strain SH101 as a systemic infection model of SARS-CoV-2 The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies Ouzounis, CA, editor Cutadapt 2.10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -a A{100} -g T{100} -q 20 -m GeMoMa-1.7.1.jar CLI GeMoMaPipeline threads=1 r=MAPPED ERE.s=FR_UNSTRANDED ERE.c=true AnnotationFinalizer.r=NO AnnotationFinalizer.u=YES GeMoMa The authors thank Elisabeth Kirst, Jeannine Wilde and Madlen Sohn for sequencing support. The genomic sequencing data underlying this article are available in the European Nucleotide Archive (ENA) and can be accessed with accession numbers ERR6740384, ERR6740385 (Illumina) and ERR6797440 (ONT). The accession numbers for the RNA-Seq raw reads are ERR6752847 (pr-d0-lung-1), ERR6752848 (pr-d2-lung-1), ERR6752849 (pr-d2-lung-2) and ERR6752850 (pr-d3-lung-2). The assembled genome together with annotation has been uploaded to figshare (https://doi.org/10.6084/m9.figshare.16695457 -ENA submission pending).