key: cord-0740262-k7ax0gph authors: Nomburg, Jason; Meyerson, Matthew; DeCaprio, James A. title: Pervasive generation of non-canonical subgenomic RNAs by SARS-CoV-2 date: 2020-08-04 journal: bioRxiv DOI: 10.1101/2020.04.28.066951 sha: 501533c916ec615a583d94806a2cefd9db017935 doc_id: 740262 cord_uid: k7ax0gph Background SARS-CoV-2, a positive-sense RNA virus in the family Coronaviridae, has caused a worldwide pandemic of coronavirus disease 2019 or COVID-19 Coronaviruses generate a tiered series of subgenomic RNAs (sgRNAs) through a process involving homology between transcriptional regulatory sequences (TRS) located after the leader sequence in the 5’ UTR (the TRS-L) and TRS’ located near the start of structural and accessory proteins (TRS-B) near the 3’ end of the genome. In addition to the canonical sgRNAs generated by SARS-CoV-2, non-canonical sgRNAs (nc-sgRNAs) have been reported. However, the consistency of these nc-sgRNAs across viral isolates and infection conditions is unknown. The comprehensive definition of SARS-CoV-2 RNA products is a key step in understanding SARS-CoV-2 pathogenesis. Methods Here, we report an integrative analysis of eight independent SARS-CoV-2 transcriptomes generated using three sequencing strategies, five host systems, and seven viral isolates. Read-mapping to the SARS-CoV-2 genome was used to determine the 5’ and 3’ coordinates of all identified junctions in viral RNAs identified in these samples. Results Using junctional abundances, we show nc-sgRNAs make up as much as 33% of total sgRNAs in vitro, are largely consistent in abundance across independent transcriptomes, and increase in abundance over time during in vitro infection. By assessing the homology between sequences flanking the 5’ and 3’ junction points, we show that nc-sgRNAs are not associated with TRS-like homology. By incorporating read coverage information, we find strong evidence for subgenomic RNAs that contain only 5’ regions of ORF1a. Finally, we show that non-canonical junctions change the landscape of viral open reading frames. Conclusions We identify canonical and non-canonical junctions in SARS-CoV-2 sgRNAs and show that these RNA products are consistently generated across many independent viral isolates and sequencing approaches. These analyses highlight the diverse transcriptional activity of SARS-CoV-2 and offer important insights into SARS-CoV-2 biology. 8 and using other sequencing technologies. To this end, we assessed five additional 150 transcriptomes generated from Vero cells (12), A549 cells (13), Calu3 cells (14), 151 bronchial organoids (15), and ferret nasal washings (13) We sought to assess if non-canonical junctions are driven by dataset-specific artifacts. 161 For example, it is possible that defective particles containing defective viral genomes 162 (DVGs), often with large internal deletions that increase their replicative fitness, could 163 accumulate during preparation of virion stocks (16) and generate the observed non-164 canonical junctions during subsequent infections. If this was true, we would expect to 165 observe non-canonical junctions specific to a single virion preparation and accumulation 166 of these specific non-canonical junctions over the course of infection. To address this 167 possibility, we identified dataset-specific junctions by identifying junctions that are 168 differentially abundant across datasets ( Figure 2B ). This analysis revealed that the vast 169 majority of 5' and 3' non-canonical junctions are low abundance and are similarly 170 abundant across five independent SARS-CoV-2 sequencing studies (7-9, 12, 13) 9 ( Figure 2C, D) . This suggests that most non-canonical junctions are not driven by 172 dataset-specific artifacts. While there is not a single strong motif driving non-canonical, intra-ORF junctions, it is 215 still possible that non-canonical junctions are associated with junction-specific homology 216 between the 5' and 3' junction sites. To address this possibility, we tallied the length of 217 homology between the 5' and 3' sites of canonical and non-canonical junctions. This 218 analysis revealed that canonical junctions have markedly longer homology lengths 219 (mean >10 bases) than non-canonical junctions (mean < 5 bases) (Figure 4E-G) . 220 Indeed, the homology lengths of non-canonical junctions are only slightly longer than 221 homology lengths between 100,000 random pairs of SARS-CoV-2-derived 30 base 222 sequences ( Figure 4E-G) . These data indicate that non-canonical junctions are not 223 associated with TRS-like homology. While there is clearly higher coverage of the 5' region of ORF1a in the three dRNAseq 235 transcriptomes, it remained unclear if these 5' ORF1a sgRNAs are a major percentage 236 of total ORF1a in the cell, or if this finding was due to a systematic bias favoring shorter 237 reads in nanopore sequencing. To address this question, we assessed ORF1a 238 coverage in four independent Illumina polyA-enriched RNAseq transcriptomes. Analysis 239 of these transcriptomes revealed a similar elevated 5' ORF1a coverage and consistent 240 drops in coverage around genome positions 1600 and 6000 ( Figure 5D-G) . Importantly, 241 one of these transcriptomes is from bronchial organoids ( Figure 5G ) and one 242 transcriptome is from the nasal washings of an infected ferret (Figure 5F ), suggesting 243 that 5' ORF1a subgenomic RNAs can be produced in vivo and were not only due to 244 artifacts of highly replicative conditions in vitro. Furthermore, analysis of a transcriptome 245 generated from RNA that was not polyA-enriched revealed a consistent elevation in 5' 246 ORF1a coverage (Figure 5H ), ruling out a systematic bias from polyA isolation. We 247 found that there is similar elevated 5' ORF1a coverage in the circulating human 248 coronavirus CoV 229E (17) (Supplementary Figure 4) , supporting the existence of 5' 249 ORF1a RNAs in other coronaviruses. All together, these data support the hypothesis 250 that subgenomic RNAs containing just the 5' region of ORF1a are produced by SARS-251 CoV-2. To test the hypothesis that non-canonical junctions can result in variant open reading 255 frames (ORFs) in nc-sgRNAs, we assessed the ORF content of each read in the three 256 dRNAseq transcriptomes. Because dRNAseq can have an error rate of greater than 257 10% (18), we generated "coordinate-derived transcripts" by determining the start, end, 258 and junction coordinates of each raw transcript by mapping to SARS-CoV-2 259 (NC_045512.2) and extracting the associated genomic sequence. While this approach 260 is capable of overcoming sequencing errors and can resolve major RNA species, it is 261 not capable of resolving small or structurally complex RNA rearrangements. Because 262 the 5' end of nanopore reads may be truncated due to technical factors during 263 S ORF variants make up between 5.1% and 24.2% of all S ORFs identified in these 286 datasets ( Figure 6A-C) . The majority of S ORF variants are predicted from nc-sgRNAs 287 with 3' junctions landing within the S ORF (Supplementary Figure 2D- While prior studies have identified nc-sgRNAs in SARS-CoV-2 (8, 11) and other 304 coronaviruses (17), there has been no comprehensive analysis of SARS-CoV-2 nc-305 sgRNAs. We conducted an integrative analysis of eight independent SARS-CoV-2 306 transcriptomes to catalogue unexpected, non-canonical RNA species. We show that 307 SARS-CoV-2 produces nc-sgRNAs, that these nc-sgRNAs increase in abundance over 308 time, and that they are not associated with TRS-like homology. We highlight strong 309 evidence supporting the existence of nc-sgRNAs containing only the 5' region of ORF1a 310 and show that the presence of nc-sgRNAs can alter the landscape of viral ORFs. 311 We found that canonical sgRNAs are consistently produced across SARS-CoV-2 313 transcriptomes, and identify eight major sgRNAs encoding S, ORF3a, E, M, ORF6, 314 ORF7a/b, ORF8, and N. Each of these sgRNAs is supported by abundant 3' junctions 315 originating at a TRS-B just upstream the associated ORF. Similar to other reports, we 316 do not find evidence of a major series of sgRNAs encoding ORF10 (7, 8, 11) . We also 317 do not find strong evidence of a major group of sgRNAs encoding ORF7B despite the 318 presence of a TRS-core-like sequence (AAGAAC) proximal to the ORF7B start. The consistent presence of non-canonical junctions across the SARS-CoV-2 genome in 341 independent transcriptomes suggests that the bulk of the observed nc-sgRNAs are not 342 dataset-specific but are generalizable to SARS-CoV-2 biology. Furthermore, their 343 nonspecific increase in abundance over time in vitro supports a model in which nc-344 sgRNAs are generated nonspecifically and at higher frequencies later in infection. We 345 found that non-canonical junctions are not associated with TRS-like homology, raising 346 the possibility that there is another mechanism behind their formation. For example, it is 347 possible that factors such as RNA structure (23), accumulation of viral proteins, or 348 actions of the viral RdRp and associated cofactors (24) could influence nc-sgRNA 349 formation. Indeed, a recent study of the murine betacoronavirus MHV suggests that the 350 3'-to'5' exoribonuclease Nsp14 influences recombination site selection of the viral RdRp 351 (24). Additional studies are necessary to gain a mechanistic understanding of the 352 factors that influence nc-sgRNA generation and to understand the phenotypic effects of 353 nc-sgRNAs on SARS-CoV-2 pathogenesis. Notably, considering that defective viral 354 genomes in negative-sense RNA viruses have been associated with antiviral immunity, 355 dendritic cell maturation, and interferon production (5, 16, 25, 26) it will be important to 356 assess if SARS-CoV-2 nc-sgRNAs are associated with immune activity. In conclusion, we show that SARS-CoV-2 generates nc-sgRNAs in vitro and in vivo, and 388 that these nc-sgRNAs are largely consistent across independent transcriptomes. We 389 show that nc-sgRNAs are not associated with TRS-like homology, suggesting there may 390 be a homology-independent mechanism driving their formation. Finally, we show that 391 non-canonical junctions have the potential to influence the landscape of viral ORFs, and 392 that there is especially strong support for nc-sgRNAs that contain only a 5' portion of 393 ORF1a. Future studies are necessary to understand the mechanisms behind nc-sgRNA 394 formation, and to understand the influence of nc-sgRNAs on SARS- Reads in all transcriptomes were mapped against the SARS-CoV-2 reference genome 409 (accession NC_045512.2) using minimap2 (30) and custom parameters detailed in 410 minimap_sars2.sh and inspired by Kim et al. (8) , and unmapped reads were discarded. were created by using these coordinates to retrieve the associated sequence from the 502 SARS-CoV-2 genome. Any minimap2-called deletion or intron of at least 1000 bases in 503 length was considered a junction. All coordinate-derived transcripts were then aligned 504 against a minimal leader sequence fragment (5' -505 CTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTC -3') 506 using blastn (36). This minimal sequence was used instead of the full leader sequence 507 because inspection of 5' coverage in the three long-read datasets revealed a significant 508 drop in coverage at the end of this sequence. Transcripts with an alignment of at least 509 45 nucleotides in length and a percent identity of at least 85%, and that contain the start 510 of this sequence within the first 50 nucleotides of the transcript, were kept. Transcripts 511 not containing this leader sequence were discarded. 512 513 ORFs were called directly from each transcript using Prodigal (37), using parameters 514 listed in prodigal.sh. ORFs were allowed to initiate from any nUG start codon, were not 515 required to be the most 5' ORF on each transcript and were only reported if they are at 516 least 30 codons in length. Each ORF was translated into amino acid sequence using 517 prodigal_to_orf_direct.py. If multiple Prodigal-predicted ORFs contained an overlapping 518 amino acid sequence, the longest ORF was output. Each ORF was output in fasta 519 format and labeled with transcript of origin and coordinates of the ORF on the transcript. 520 The amino acid sequence of each ORF was then mapped against canonical SARS-522 CoV-2 proteins (from accession NC_045512.2) as well as Prodigal-predicted 523 unannotated ORFs from the SARS-CoV-2 genome using DIAMOND (38). DIAMOND position was calculated and plotted as an inverse peak. The histogram bin size is 590 100 bases, meaning each inverse peak represents the cumulative count of 5' or 3' 591 junctions occurring within that span. Curved lines represent the 5' and 3' locations of 592 junctions that occur at least twice. Red curves represent canonical junctions, black 593 curves represent non-canonical junctions. "Taiaroa" (7), "Kim" (8), and "Davidson" 594 (9) represent the three independent SARS-CoV-2 dRNAseq datasets investigated. A-C) For each location on the viral genome, a histogram of 5' and 3' junctions at that position was calculated and plotted as an inverse peak. The histogram bin size is 100 bases, meaning each inverse peak represents the cumulative count of 5' or 3' junctions occurring within that span. Curved lines represent the 5' and 3' locations of junctions that occur at least twice. Red curves represent canonical junctions, black curves represent non-canonical junctions. "Taiaroa", "Kim", and "Davidson" represent the three independent SARS-CoV-2 dRNAseq datasets investigated. B) Description of computational approach to determine the consistency of 5' and 3' junction points across independent datasets. The percentage of non-canonical junctions at each genome position in each dataset was determined (X). The mean (μ) and standard deviation (σ) of percentages at each position across the five datasets was calculated. For each position in each dataset, the Z-score was calculated as (X-μ)/σ (i.e. the number of standard deviations away from the mean), and the percentages and Z-scores for each position in each dataset were plotted. C, D) Each point represents the percentage and Z-score of one position in one dataset. For each position in the SARS-CoV-2 genome, the percentage and Z-score of non-canonical junctions with a 5' end (D) or a 3' end (E) was determined as described above for five independent datasets: Taiaroa, Kim, Davidson, Finkel, and Blanco-Melo. Points with a percentage above 4% of non-canonical junctions and a Z-score above 1 were highlighted. A-C) ORFs were predicted directly from transcript-derived reads for the three dRNAseq datasets, and each ORF was aligned against the protein sequences of canonical SARS-CoV-2 genes using the DIAMOND aligner. Variant ORFs were defined as ORFs that were assigned to a canonical SARS-CoV-2 protein but had an unexpected start or stop position, while perfectly-aligning ORFs were considered canonical. A-E) For each location on the viral genome, a histogram of 5' and 3' junctions at that position was calculated and plotted as an inverse peak. The histogram bin size is 100 bases, meaning each inverse peak represents the cumulative count of 5' or 3' junctions occurring within that span. Curved lines represent the 5' and 3' locations of junctions that occur at least twice. Red curves represent canonical junctions, black curves represent non-canonical junctions. F) The percentage of junctions that are canonical in the Suzuki (Bronchial organoids) and Blanco-Melo (Ferret) datasets was determined. Junctions were assigned as canonical if their 5' location was within 20 bases of the TRS-L and their 3' location within 15 bases of a TRS-B, and otherwise assigned as non-canonical. Plant EP, Dinman JD. The role of programmed-1 ribosomal frameshifting in 735 coronavirus propagation SARS coronavirus accessory proteins. Virus 738 research The group-specific 740 murine coronavirus genes are not essential, but their deletion, by reverse genetics, is 741 attenuating in the natural host Severe 743 acute respiratory syndrome coronavirus group-specific open reading frames encode 744 nonessential functions for replication in cell cultures and mice A contemporary view of coronavirus 747 transcription Direct 749 RNA sequencing and early evolution of SARS-CoV-2 The architecture of 752 SARS-CoV-2 transcriptome Characterisation of the transcriptome and proteome of SARS-CoV-2 reveals a cell 755 passage induced in-frame deletion of the furin-like cleavage site from the spike 756 glycoprotein The architecture of 758 SARS-CoV-2 transcriptome Characterisation of the transcriptome and proteome of SARS-CoV-2 using direct 761 RNA sequencing and tandem mass spectrometry reveals evidence for a cell passage 762 induced in-frame deletion in the spike glycoprotein that removes the furin-like cleavage 763 site The coding capacity of SARS-CoV-2 Imbalanced host response to SARS-CoV-2 drives development of COVID-19 Bulk and 770 single-cell gene expression profiling of SARS-CoV-2 infected human cell lines identifies 771 molecular targets for therapeutic intervention Generation of 773 human bronchial organoids for SARS-CoV-2 research The impact of defective viruses on infection and immunity. 776 Annual review of virology Direct RNA nanopore sequencing of full-length coronavirus genomes provides novel 779 insights into structural variants and enables modification analysis. Genome research Severe acute respiratory syndrome coronavirus (SARS-CoV) infection inhibition using 789 spike protein heptad repeat-derived peptides Population dynamics of an RNA virus and its defective 792 interfering particles in passage cultures Biological activities of'noninfectious' influenza A virus particles. 794 Future virology Structure of the full SARS-CoV-2 RNA genome in infected cells The coronavirus proofreading exoribonuclease mediates extensive viral 800 recombination Immunostimulatory defective viral genomes from respiratory syncytial virus promote a 803 strong innate antiviral response during infection in mice and humans Sendai virus defective-interfering genomes 806 and the activation of interferon-beta Gene expression regulation by upstream open 808 reading frames and human disease Cryo-810 EM structure of the 2019-nCoV spike in the prefusion conformation Identification of novel 813 subgenomic RNAs and noncanonical transcription initiation signals of severe acute 814 respiratory syndrome coronavirus Minimap2: pairwise alignment for nucleotide sequences Team RC. R: A language and environment for statistical computing gggenes: draw gene arrow maps in 'ggplot2 The sequence 825 alignment/map format and SAMtools 827 BLAST+: architecture and applications Prodigal: 829 prokaryotic gene recognition and translation initiation site identification. BMC 830 bioinformatics Fast and sensitive protein alignment using 832 DIAMOND