key: cord-1049250-jhg0jsqh authors: Li-Pook-Than, Jennifer; Banuelos, Selene; Honkala, Alexander; Sahoo, Malaya K.; Pinsky, Benjamin A.; Snyder, Michael P. title: Long-read sequencing of SARS-CoV-2 reveals novel transcripts and a diverse complex transcriptome landscape date: 2021-03-06 journal: bioRxiv DOI: 10.1101/2021.03.05.434150 sha: 0cc0b93e9c117e21d201c4bc77e9bc174e55370d doc_id: 1049250 cord_uid: jhg0jsqh Severe Acute Respiratory Syndrome Coronavirus 2, SARS-CoV-2 (COVID-19), is a positive single-stranded RNA virus with a 30 kb genome that is responsible for the current pandemic. To date, the genomes of global COVID-19 variants have been primarily characterized via short-read sequencing methods. Here, we devised a long-read RNA (IsoSeq) sequencing approach to characterize the COVID-19 transcript landscape and expression of its ∼27 coding regions. Our analysis identified novel COVID-19 transcripts including a) a short ∼65-70 nt 5’-UTR fused to various downstream ORFs encoding accessory proteins such as the envelope, ORF 8, and ORF 9 (nucleocapsid) proteins, that are relatively highly expressed, b) novel SNVs that are differentially expressed, whereby a subset are suggestive of partial RNA editing events, and c) SNVs at functional sites, whereby at least one is associated with a differentially expressed spike protein isoform. These previously uncharacterized COVID-19 isoforms, expressed genes, and gene variants were corroborated using ddPCR. Understanding this transcriptional complexity may help provide insight into the biology and pathogenicity of SARS-CoV-2 compared to other coronaviruses. The Severe Acute Respiratory Syndrome Coronavirus 2, SARS-CoV-2, is a highly infectious betacoronavirus and the cause of the coronavirus disease 2019 (COVID-19) responsible for the current 2020 global pandemic (as of December 2020; over 88 M cases worldwide based on Worldometer.info). The disease predominantly affects the lungs and causes varying degrees of severity and symptoms that range from asymptomatic to mild cases of fatigue, fever/chills, cough, ache, and loss of smell, although up to 20% of cases will be severe with pneumonia, dyspnea, and acute respiratory distress syndrome 1,2 . It has caused more infections to date than other severe coronaviruses, such as SARS-CoV-1 and Middle East Respiratory Syndrome Coronavirus (MERS-CoV), combined 3 . The pathway of infection for both SARS-CoV-1 and 2 is via a receptor binding domain (RBD) on the C-terminal of the spike protein S1 subunit protruding from the viral capsid that binds to human angiotensin-converting enzyme 2 (hACE2), which is present at high levels in pulmonary epithelia. Antibodies in convalescent patients are primarily targeted to this RBD region and accordingly vaccine development has been focused on this domain 4 . Notably, there are four distinct amino acid variants (G, V, E, G) present in the RBD domain of SARS-CoV-2 that increase the affinity of SARS-CoV-2 binding to hACE2 relative to other coronaviruses 5, 6 . Developing a better understanding of the sequence variation and a comprehensive transcript map of SARS-CoV-2 across global populations is important to better understand viral biology, evolution, and the development of improved therapies against COVID-19 infection. In early January 2020, the first near complete SARS-CoV-2 genome was released to the GISAID database and is widely used as the primary COVID-19 reference genome (NC_045512). As of December 2020, over 47,600 SARS-CoV-2 sequences have now been recorded in the NCBI database. SARS-CoV-2 is a large RNA virus with a ~29.8kb genome sharing up to 79% sequence identity with SARS-CoV-1; other family members include HCoV-229E, HCoV-OC43, HCoV-NL63, and HCoV-HKU1, which cause common colds 3, 7 . SARS-CoV-2 contains 27 proteins generated by 14 open reading frames (ORFs) (Fig. 1a) . The long ORF1a and ORF1b regions are translated as one large peptide and then post-translationally cleaved into 15 nonstructural proteins (nsp), including nsp 1-10, 12a/b, 13-16, an RNA dependent RNA polymerase (RdRP), helicase (H), exonuclease (Ex) and endonuclease (En) 8 . The nsp proteins include two viral proteases, nsp 3 (papain-like protease) and nsp 5 (3C-like protease), involved in the post-translational processing of ORF1a/b. Further downstream are the structural proteins, including the two subunits of the spike surface protein, S1 and S2, which contain the least-conserved regions among coronaviruses, as well as the regions encoding the receptor binding motif (RBM) within the RBD and a polybasic cleavage site between S1 and S2 9 . Following the S2 region are seven accessory proteins interspersed between the envelope protein (E), membrane glycoprotein (M) and nucleocapsid phosphoprotein (N) regions. The most common methods for the molecular detection of COVID-19 involve extraction of nucleic acids from nasopharyngeal swabs, conversion of RNA to cDNA, and amplification using qPCR probes designed to conserved sites within the SARS-CoV-2 genome, including regions in the E-protein and RdRP genes. The mechanism of gene expression in positive strand RNA viruses like SARS-CoV-2 is complex with three hypothesized mechanisms for the production of mature RNAs, subgenomic (SG) RNAs, from a large negative strand (-ve) RNA template. These proposed mechanisms include using 1) the -ve strand internal promoter for transcription initiation; 2) prematurely terminated -ve strands acting as heterogeneous templates for production of each SG RNA species; or 3) discontinuous RNA synthesis of the -ve strand template 10 . The vast majority of COVID-19 genomes and transcripts sequenced to date have used short-read methods which, although powerful for rapid sequencing results and sequence comparison, nonetheless yield limited data on the different SARS-CoV-2 transcripts. In particular, we do not know the catalog of the transcripts produced by the virus and the levels of these transcripts, nor do we know which single nucleotide variants are associated with the different isoforms and mechanisms by which the variants might have formed. Such information will be valuable for understanding the basic biology of this virus and its pathogenicity 11 , particularly in the context of emerging new strains, such as the recently identified B7.1.1 variant that may have increased transmission characteristics 12, 13 . To better understand the SARS-CoV-2 RNA landscape, we adapted a long-read PacBio sequencing technology to characterize SARS-CoV-2 transcriptomes from multiple COVID-19 patients. We developed an analysis pipeline to describe a suite of COVID-19 transcripts, several of which appear to be full-length, and describe their levels, a subset validated by ddPCR. We discovered novel 5'-and 3'-end UTRs and unusual associated transcriptional rearrangements. These UTRs often included novel repetitive sequences. We further associate a natural single nucleotide variant (SNV) associated with specific spike protein isoforms with an expression decrease relative to the "wildtype" nucleotide. Taken together, these data provide new insight into the overall transcript-processing landscape of SARS-CoV-2, including partially RNA edited transcripts, and provide an improved framework for understanding COVID-19 variants, gene isoforms, and their expression mechanisms. Characterization and quality of SARS-CoV-2 RNA. We adapted a method for long read sequencing of SARS-CoV-2 transcripts using Pacific Biosciences SMRT Sequencing technology (Fig. 1) . Patients from the San Francisco Bay Area were tested for COVID-19 using nasopharyngeal swabs and an FDA-Emergency Use Authorized realtime RT-PCR method, with Cycle threshold (Ct) values recorded as a proxy for abundance ( Supplementary Fig. 1 ). We independently confirmed the presence and level of SARS-CoV-2 sequences using ddPCR (Fig. 1c) . This is using the parallel E-protein primer/probes of the Gold-Standard SARS-CoV-2 qPCR detection method, but with the additional dual-channel fluorometric capacity of the ddPCR platform (Materials and Methods). From a set of 24 COVID-19 positive samples, six of high quality were selected for sequencing and analysis (Fig. 1b) . Total RNA isolated from these high-quality nasopharyngeal samples showed an average transcript size of 1500-3000 nt using the Bioanalyzer (Fig. 1d ) as compared to other samples that have many lower sized RNAs, suggestive of some degradation ( Supplementary Fig. 1a) . The nasopharyngeal RNAs of these six samples primarily produced a single band peak at 2 kb ranging from ~1.5 to 4 kb while unexpectedly rRNAs from the human host were not detected in the Bioanalyzer data. Two SARS-CoV-2 RNA controls were also analyzed in parallel with the patient samples, 1) a Twist Bioscience SARS-CoV-2 (MT199235 -USA/CA9/2020) synthetic control containing six non-overlapping 5 kb fragments and 2) SARS-CoV-2 RNAs isolated from Vero E6 mammalian cells transfected with a 30 kb cloned isolate (MN985325.1 -2019-nCoV/USA-WA1/2020) supplied by ATCC (ATCC® VR-1986D™). These controls also produced a single band using the bioanalyzer ( Supplementary Fig. 2 ). Overall, we conclude that SARS-CoV-2 RNAs of 1.5 to 4 kb in size dominate the pool of nasopharyngeal RNAs in infected patients. Because patient SARS-CoV-2 RNAs are low in abundance, to optimize concentrations and potentially identify as many SARS-CoV-2 transcripts as possible, two sets of total RNA were pooled from three different high-quality COVID-19+ samples: Pool A (Patients 25, 125, 290) and Pool B (Patients 90, 260, 300). Universal Human Reference RNA (UHRR, Agilent) was included to balance the RNA concentrations in the pools. cDNA was generated from total RNA using a 9:1 ratio of random primers:anchored oligo-dT primer and used to generate cDNA libraries ranging from 1500-3000 bp. SARS-CoV-2 viral transcripts were captured using custom designed 120bp IDT probes. Sequencing was done on a single SMRT Cell on the PacBio Sequel system. This resulted in 59,674 circular consensus sequence (CCS) reads from Pool A and 42,629 from Pool B. Since the UHRR was spiked into the sample to increase RNA concentrations, the majority of sequenced reads matched the UHRR controls (and not human host sequence as observed by lack of rRNA), corroborating the observation that prior to spike-in, most real-world sample reads are viral ( Table 1 ). The average length of the reads was ~1915 nt for Pool A and ~1871 nt and in Pool B (Table 1) , and range in size from 300 nt -5 kb (Supplementary Fig. 2a-b) . To capture the variety of SARS-CoV-2 RNA transcripts and their expression levels from the longreads, we used a combination of StringTie v2.1.3 as well as our own custom pipeline to assemble reads into transcript classes (LORE pipeline, Supplementary Fig. 3 ). The two pooled sets of data from the six COVID-19 positive samples revealed at least fifteen distinct isoforms of SARS-CoV-2 transcripts based on the presence of presumed fusion events and that many of these contained polyA tails (although this does not include the variable 5' ends, which are heterogeneous). Interestingly, only two of these isoforms were shared among the pooled sets, Ai2/Bi2 and Ai7/Bi3 ( Fig. 2 ). In addition, three sets of sequences detected, Ai7, Bi1 and Bi3 at relatively low abundance, lacked spliced regions or polyA tails and may thus represent either subgenomic fragments or parts of transcripts. Finally, several fusion transcripts were found in very low abundance <5 and are described further below. In aggregate, the transcripts fully covered the reference genome isolate Wuhan-Hu-1 (MT008022, NC_045512.2). The majority of transcripts identified ranged in size from 300 nt to ~5kb and their relative abundance based on read frequency (which may be biased) is shown in Fig. 2 . We believe that many of these are full length transcripts as they contain 5' genomic regions as well as poly A tails. Indeed, on average, 13.5% and 6.0% of transcripts in Pools A and Pool B, respectively, contained poly A tails ranging from 15-40 A's in length, which were found at the end of the majority of identified isoforms. The majority of polyA transcripts were found on the transcript of the accessory proteins isoforms, particularly the penultimate nucleocapsid (N) ORF (Fig. 2 green) . The high prevalence of this polyA transcript may be a result of 3' bias of the RT-PCR method used to generate the sequencing libraries. All 27 ORFs were expressed, albeit at varying levels, in the nasopharyngeal samples (Fig. 2) . The majority of nsp (blue, Fig. 2 ) were found to be expressed on the same transcript, although at low expression levels relative to the accessory proteins (in Pool A), whereas the spike protein had expression levels below the limit of detection via StringTie in Pool A. However, the spike protein was detectable as part of a long isoform also containing nsp and partial accessory protein (green, Fig. 2 ) in Pool B. Each SARS-CoV-2 gene was also observed for the ATCC controls, albeit at more homogenous expression levels, as expected for cloned products. The two SARS-CoV-2 RNA controls (Twist and ATCC controls) were also sequenced in parallel with patient samples. As expected, when mapped to genomic regions, the reads from the synthetic interspaced 5 kb fragments of the Twist control produce five clusters of reads whereas the ATCC control, generated in vitro, has a more distributed expression pattern similar to the COVID-19 patient samples (Fig. 3 ). In the real-world patient samples, up to fourteen distinct isoforms were identified in the population of RNAs, with the most abundant expression (highest FPKM) including four variations of genes at the 3' end ORFs (Fig. 2, left) . These ORFs include part of the envelope protein (E), the membrane glycoprotein (M), ORFs 6-8, ORF 9/nucleocapsid phosphoprotein (N) and ORF 10. Corroborating the fusion hypothesis in coronavirus transcription, a shorter 1-65 nt segment of 5' UTR, also described by Kim et al., was found to be joined or 'alternatively spliced' to four varieties of isoforms, including M, N, and ORFs 6-8 (downstream accessory proteins), instead of the annotated full-length 1-260 nt 5' UTR 10,15 ( Fig. 2-green section) . The ATCC control (derived from clones) also showed a variety of full-length isoforms, some altogether lacking the expected fulllength 5' UTR. Segregated isoforms of 5' and 3' UTR transcripts were also identified in the population, as a subset of fused isoforms of N protein to poly A, missing ORF 10 and 3' UTR (NA1 and NB1). Although the RNAs were heated to denature RNA structures (Methods), both 5' and 3' UTRs of coronaviruses are known to have significant secondary hairpin structures 16 and the reverse-transcriptase methods used to generate the sequencing libraries may be unable to capture the entire 5' region. Distinct from traditional long-sequencing pipelines 17, 18 , the Long-Reads Analysis (LORE) was developed to additionally identify strings of consecutive genes, non-consecutive regions, as well as characterize the single nucleotide variants (SNVs) transcripts alongside their relative expression levels. A number of expressed novel single nucleotide variants (SNVs) were also discovered in the long read regions as shown in Fig. 3 ; some of these likely have functional consequences. In order to identify SNVs of high confidence, variant correction based on sequencing depth was applied to correct for platform errors, which include RT-PCR amplified misincorporated single nucleotides, as well as the relatively high error rate of indels in raw PacBio long-sequence reads (data not shown). Similarly, the ATCC controls are from cultured Vero cells, potentially introducing variant amplification errors distinct from the original strain. Nine and seven high confidence SNVs were identified in samples from Pools A and B, respectively, with a minimum coverage depth of 25 reads (located >10 nt from the read ends). These SNVs are shown relative to the Twist and ATCC controls in Fig. 3a . Four SNVs overlapped between Pool A and the ATCC control (reference is a Washington source patient). Over 80% of the novel SNVs are C-to-U (9) or A-to-G (4), suggesting an RNA editing (deaminase-like) modification (Supplementary Table 2 ). Positions 17747, 17858 and 18060 show expression of their alternate variants at 3%, 2% and 4%, respectively, and suggest these are candidate transcripts for partial editing (Figure 4a isoform of nsp 14-16 (Ex, En, Me) fused to poly A, contain the wildtype "U" at position 26360, while a subset of tiled reads (concatenated SNVs found in nsp 12b, 13-15), categorized here as subgenomic ( Fig. 2 ; SG-Pool A asterisks), contain the "C" SNV at the respective position. Interestingly, two SNVs were found near (1-2 nt) protease cleavage site junctions that may have important functional consequences. A C-to-U change was found 1 nt from the nsp2/3 cleavage site at position 2721, and a G-to-U change was identified 1 nt from the spike subunit 1 and subunit 2 cleavage site at position 23403 ( Fig. 3a and 4a , green arrows). Interestingly, this latter site is associated with a significant decrease in downstream expression levels (green downward triangles, Fig. 3b ) in Pool B compared to Pool A (wildtype). The spike protein isoforms from Pool B showed lowered expression by at least five-fold and were associated with consecutive longsequenced reads that contained a series of SNVs including at the position 23403 SNV, two A-to-G variants at position 21562 (1 nt before the AUG start site of S1 subunit) and position 22460 within the S1 subunit, as well as a G-to-U variant at position 25563 downstream of the S2 subunit in ORF 3 (Fig. 2 , SG-Pool B asterisks, Fig. 3b , green triangles, and Fig. 4a , blue asterisks). In total, sixteen high quality SNVs were identified--in seven cases more than one SNV was found to be expressed at a particular site, suggesting that these may be more polymorphic regions (Fig. 4 ) and represent different the patients from each Pool. We validated the presence of three SNVs and their expression levels using ddPCR in comparison to the ATCC and Twist synthetic RNA controls and the SARS-CoV-2 Wuhan reference (MT008022) genome ( Fig. 3-4a asterisks) . Position 8782 (C>U) was confirmed as a wildtype "C" in Pool B and as expected, found to be the variant "U" in the ATCC control (Fig. 4b) ; resulting in a synonymous mutation. In Pool A, position 23403 (A>G) was confirmed in the spike S1 subunit of the ATCC control and matched the reference genome "A" nucleotide ( Fig. 4c ) whereas in Pool B had the "G" variant (Fig. 4c, bottom panel) . The latter of these SNVs result in an amino acid change, D-to-G. ddPCR also revealed (and corroborated) the absolute levels of these isoforms in the pools. At position 14407 (nsp 12b), the "C" nucleotide was confirmed in the Twist RNA control, while in Patient 300 (part of Pool B), the "U" variant is observed. This latter variant results in an amino acid change P-to-L. Long-read sequencing identified a diverse transcriptional landscape in COVID-19 patients compared to what has been described previously. A wide range of transcripts, and isoform expression, was uncovered ranging 300-5000 nt in length, including detection of novel transcript isoforms and differentially expressed isoforms related to newly identified SNVs. The recombined architecture of SARS-CoV-2 was previously described by Kim et al., 2020 in Vero-infected cells using Nanopore and Nanoball sequencing, which allowed for the identification of discontinuous Overall, there have been reports of SARS-CoV-2 mutations thought to drive pathogenicity, several of which are found within the spike protein region with corresponding changes in the coding amino acid 22, 24 . It will be of interest to further investigate the 3D protein structure in context to these amino acid changes. Notably, the majority of SNVs were found to be A-to-G or C-to-U deaminase changes reminiscent of RNA editing events A-to-I ADAR and C-to-U APOBEC machinery. This pattern was observed when SARS-CoV-2 bronchoalveolar lavage fluid-based RNA sequence data was compared with those of other coronaviruses including MERS-CoV and SARS-CoV-1 25 . In their work, the authors proposed several mechanisms for SNV changes, including one where RNA editing occurs in coordination with negative strand transcription events 25 . Several variants we found showed partially expressed RNA editing, and several were located close to viral transcript rearrangement junctions associated to transcripts of variable gene expression. This points towards biological pathways operating in parallel to post-transcriptional modification events and secondary/tertiary RNA structures being critical for the recognition of specific consensus sites and possibly regulating the mechanisms by which viral transcript rearrangements and differential transcription events take place 26, 27 . Distinct RNA structures and relatively high frequency of RNA viral recombination events in the family of Coronaviruses suggest a mechanistic regulation of viral genome organization rather than a selection of an evolutionary advantageous genotype 28 . Notably, distinct secondary structures are observed in coronaviruses, particularly the 5' and 3' Table 4 ). StringTie v2.1.4 usage included long reads processing (-L) which also enforces -s 1.5 -g 0 (default:false) and reference annotation (-G) used for guiding the assembly process (GTF/GFF3). While StringTie and IGV analysis were efficiently identifying gene rearrangements in consecutive order, a subset of reads also contained novel sequences and gene rearrangements. were selected as high quality (as summarized in Supplementary Table 2 ). C1) Identify junction sites from the reference genome and categorize SNV locations within 10 nt from the junction. Discern novel-splice junctions 34 . C2) Calculate differential expression level downstream of isoforms containing SNVs relative to wildtype. RPKM = (CDS read count * 10 9 ) / (CDS length * total mapped read count). FPKM or Fragment read is used when data is paired then only one of the mates is counted. Additional sequences (nRep) that are associated with the long-read PacBio platform as adaptor sequence were identified using LORE (Part B). These nReps were located upstream and downstream of known genes in 25-28% of the samples including controls: namely nRep1 a ~66 nt sequence "GGCAAUGAAGUCGCAGGGUUGUACUCUGCGUUGAUACCACUGCUUCCCUGUGGU UGUACGUCAAGG" comprised of 3 core elements (A-B-C) and the entire nRep1 segment is repeated (at least in part) up to 12 consecutive times and was up to 300 nt in length; nRep2 consists of a 25nt middle subsection of nRep1, "GUACUCUGCGUUGAUACCACUGCUU" (middle core element, B-B) repeated up to eight time; nRep3 is also a subsequence of nRep1 "GGCAAUGAAGUCGCAGGGUU" (first core element, A-A) repeated up to nine times and was mostly found upstream of the 5' UTRs; and nRep4 is a combination of the core elements 2 and 1 (B-A), "GUACUCUGCGUUGAUACCACUGCUUCGGCAAUGAAGUCGCAGGGUU". Overall, different mixtures of these cores are found upstream of the SARS-CoV-2 ORFs, in particular nsp 3. Interestingly, in four cases nRep1 sequences were located in the middle of SARS-CoV-2 ORFs (out of order) or with exogenous genes (mammalian RAB and SOD2) suggesting these exogenous adaptor primers form concatemers and can fuse with both viral and non-viral genes. These may also be artifacts of reverse transcriptase polymerase skipping to novel sequences, partly due to the predicted secondary structure of the repeat regions observed. This was also observed in other coronavirus examples, particularly with ORFs at the 5' end of the genome, including a mouse coronavirus model that manifests viral hepatitis 15, 35 . Consequently, these nReps are filtered from the raw data file in the pipeline. Envelope (E), membrane protein (M) and nucleocapsid (N) are the identified structural proteins (green). Isoforms and expression levels (in FPKM) identified through the StringTie_v2 pipeline are denoted as Ai1-7 and Bi1-3 from Pools A and B, respectively. The number of unusual isoforms detected through the developed LORE pipeline are denoted as NA(1-7) and NB(1-2) from Pools A and B, respectively. Red asterisks show SNV location, blue asterisk is wildtype nucleotide within an isoform in Pool A. Asterisk in a bracket (*) represent a subpopulation of SNV among the reads that is the wildtype nucleotide, suggesting partial editing. . . Green arrows marked as J2/3 and J-S1/S2 denote a SNV within 10 nucleotides of the junction sites nsp 2/nsp 3, and spike protein subunit 1 and subunit 2, respectively. Asterisk denotes SNV validated through ddPCR (Fig. 4) . (b) Bar graph representation of SARS-CoV-2 ORF/gene expression relative to total reads among the long-sequenced transcript population for Pool A (blue), Pool B (orange) and ATCC SARS-CoV-2 control (gray). Note the lower expression in genes downstream of S2 for Pool B samples suggesting an association with the SNV at the spike subnunits 1 and 2 junction (J-S1/S2) depicted in (a). Table 1 for primer/probe detail. Reference RNA genome GenBank: MT008022.1 (Wuhan) was used. ddPCR allows for dual-channel probes (FAM and HEX) which were designed to specify each nucleotide (wildtype and variant, respectively). FAM (y-axis) is represented by blue dots and HEX (x-axis) is represented by green dots (droplets) in panels b-d. Origin and evolution of pathogenic coronaviruses Recognizing the asymptomatic enemy Characteristics of SARS-CoV-2 and COVID-19 Structure, Function, and Antigenicity of the SARS-Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein Receptor Recognition by the Novel Coronavirus from Wuhan: an Analysis Based on Decade-Long Structural Studies of SARS Coronavirus Structural basis of receptor recognition by SARS-CoV-2 Genomic characterization and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan The proximal origin of SARS-CoV-2 Continuous and Discontinuous RNA Synthesis in Coronaviruses Coronavirus biology and replication: implications for SARS-CoV-2 Genetic Variants of SARS-CoV-2 -What Do They Mean? JAMA Publ Covid-19: What have we learnt about the new variant in the UK? A flexible and efficient template format for circular consensus sequencing and SNP detection The Architecture of SARS-CoV-2 Transcriptome The structure and functions of coronavirus genomic 3' and 5' ends High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing Transcriptional fates of human-specific segmental duplications in brain Regulatory elements in the viral genome Subgenomic messenger RNAs: Mastering regulation of (+)-strand RNA virus life cycle SARS-CoV-2 spike-protein D614G mutation increases virion spike density and infectivity NextStrain: Real-time tracking of pathogen evolution Patient-Derived Mutations Impact Pathogenicity of SARS-CoV-2. SSRN Electron Evidence for host-dependent RNA editing in the transcriptome of SARS-CoV-2 Relationship between RNA splicing and exon editing near intron junctions in wheat mitochondria APOBEC-mediated editing of viral RNA Why do RNA viruses recombine? Comprehensive in-vivo secondary structure of the SARS-CoV-2 genome reveals novel regulatory motifs and mechanisms Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19 Personal omics profiling reveals dynamic molecular and medical phenotypes Integrative Genome Viewer Transcriptome assembly from long-read RNA-seq alignments with StringTie2 Discerning novel splice junctions derived from RNA-seq alignment: A deep learning approach Replication of synthetic defective interfering RNAs derived from coronavirus mouse hepatitis virus-A59 We thank Ting Hon, Jason Underwood, and Elizabeth Tseng of Pacific Biosciences for designing and performing the probe capture-based library protocol on the PacBio sequencing systems. We also would like to thank the Stanford Protein and Nucleic Acid (PAN) facility for probe and primer synthesis. JLPT and SB performed the experiments. MKS contributed to sample preparation, QC and sample preparation. JLPT conceptualized, designed bioinformatic analysis tools and wrote the first draft of manuscript. All of the authors contributed to revising the manuscript. JLPT, SB, MKS and BAP declare no competing interest. MPS is a founder and member of the science advisory board of Personalis, SensOmics, Qbio, January, Mirvie and Filtricine, and Protos and a science advisory board member of Genapsys and Epinomics.