key: cord-0798694-l3evokd6 authors: Wong, Chee Hong; Ngan, Chew Yee; Goldfeder, Rachel L.; Idol, Jennifer; Kuhlberg, Chris; Maurya, Rahul; Kelly, Kevin; Omerza, Gregory; Renzette, Nicholas; De Abreu, Francine; Li, Lei; Browne, Frederick A.; Liu, Edison T.; Wei, Chia-Lin title: Subgenomic RNAs as molecular indicators of asymptomatic SARS-CoV-2 infection date: 2021-02-06 journal: bioRxiv DOI: 10.1101/2021.02.06.430041 sha: e2c6f8343389f96ae3619e302e12ce80934934a9 doc_id: 798694 cord_uid: l3evokd6 In coronaviridae such as SARS-CoV-2, subgenomic RNAs (sgRNA) are replicative intermediates, therefore, their abundance and structures could infer viral replication activity and severity of host infection. Here, we systematically characterized the sgRNA expression and their structural variation in 81 clinical specimens collected from symptomatic and asymptomatic individuals with a goal of assessing viral genomic signatures of disease severity. We demonstrated the highly coordinated and consistent expression of sgRNAs from individuals with robust infections that results in symptoms, and found their expression is significantly repressed in the asymptomatic infections, indicating that the ratio of sgRNAs to genomic RNA (sgRNA/gRNA) is highly correlated with the severity of the disease. Using long read sequencing technologies to characterize full-length sgRNA structures, we also observed widespread deletions in viral RNAs, and identified unique sets of deletions preferentially found primarily in symptomatic individuals, with many likely to confer changes in SARS-CoV-2 virulence and host responses. Furthermore, based on the sgRNA structures, the frequently occurred structural variants in SARS-CoV-2 genomes serves as a mechanism to further induce SARS-CoV-2 proteome complexity. Taken together, our results show that differential sgRNA expression and structural mutational burden both appear to be correlated with the clinical severity of SARS-CoV-2 infection. Longitudinally monitoring sgRNA expression and structural diversity could further guide treatment responses, testing strategies, and vaccine development. COVID-19, emerged in late 2019, was caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). With its high infectivity and mortality rates, particularly in individuals of older age and those with pre-existing health conditions, COVID-19 has rapidly expanded into a global pandemic. Of great importance in the management of the pandemic is the observation that many infected individuals are asymptomatic, ranging from 20-80% 1,2,3 . Asymptomatic patients, while having faster viral clearance 4-7 , appear to have similar viral loads compared to symptomatic patients 4, 5, [8] [9] [10] [11] and, therefore, can effectively transmit the disease. Since viral load is not a reliable predictor of disease severity, we examined the genomic biology of SARS-CoV2 infection in primary patient samples for other correlates of clinical severity. To understand the pathophysiology of COVID-19 infection, major effort has been made on fully decoding the SARS-CoV-2 genome and its genetic variation, specifically the single nucleotide variants (SNVs) 12, 13, 14 . SARS-CoV-2 is a positive, single-stranded RNA virus. Upon infecting into the host cells, the viruses deploy both replication and transcription to produce fulllength genomic ~30-Kb RNAs (gRNAs) and a distinct set of "spliced" subgenomic transcripts (sgRNAs). These sgRNAs are transcribed through a "discontinuous transcription" mechanism 15 by which negative-strand RNAs are produced from the 3' of gRNAs followed by a template switch from a 6-nucleotide ACGAAC core transcription regulatory sequence (TRS) that are Here, we systematically characterized the diversity and prevalence of structural deletions and sgRNA expression in primary human tissues from both symptomatic and asymptomatic individuals using a suite of genomic and transcriptomic analyses. From routine swabs collected for diagnostic purpose, we ascertained sgRNA configurations and found that their abundance, both as individual sgRNA species and collectively as a group, is drastically reduced in asymptomatic infection. Moreover, we identified widespread structural deletions in the SARS-CoV-2 genomes, particularly in the regions encoding sgRNAs. Distinct sets of deletions can be consistently and preferentially found in independent SARS-CoV-2 genomes associated with symptomatic and asymptomatic cases, respectively, suggesting their functional significance. To understand the impact of structural variants on the viral protein integrity, we analyzed the predicted viral proteomes from full-length viral transcript isoforms. Our results reveal the highly unstable nature of SARS-CoV-2 genomes and reveal the potential utility of sgRNA expression as an indicator of clinical severity. SARS-CoV-2 gRNAs and sgRNAs have overall high sequence identity. To discern sgRNA from the gRNAs, we exploited the features derived from the discontinuous transcription, namely the joining between TRS-L and TRS-B regions whose presence exclusively was found in sgRNAs. We adopted amplicon-based sequencing (amplicon-seq), a method widely used to characterize SARS-CoV-2 genomes 25 , to characterize the presence of sgRNAs and correlate their abundance in the COVID-19 positive samples between symptomatic and asymptomatic patients. Ampliconseq is highly sensitive, with limit of detection (LoD) reported as low as one SARS-CoV-2 copy per microliter using the optimized protocols from the Artic network 26, 27 . Therefore, it can effectively enrich for SARS-CoV-2 cDNAs from samples of wide-range of viral content. In this approach, viral-specific primers were designed across the full length RNAs and amplicons specific for SARS-CoV-2 sgRNAs can be PCR amplified by 5' most primer next to the TRS-L sequence as forward primer and reverse primers nearest to the TRS-B sequences in the multiplex PCRs. Based on the locations of primers, we anticipated that amplicons for 6 out of the 9 sgRNA species (sgRNA_S, E, M, 6, 7b and N) can be found in the amplicon-seq (Methods). Followed by massively parallel sequencing, these subgenomic-specific amplicons can be identified through the junction reads linking TRS-L and TRS-B in the sequencing data and used to determine the relative abundance of sgRNAs (Fig. 1a) . From 51 and 30 SARS-CoV-2 positive symptomatic and asymptomatic patients respectively (where asymptomatic patients are defined as those who showed none of the key COVID-19 symptoms within 14 days of testing) (Supplementary Table 1 ), we extracted total RNA from swabs of different locations of respiratory tracts including anterior nasal, oro-and naso-pharyngeal collected for the purpose of diagnostic RT-PCR and performed amplicon-seq to generate deep sequencing data for each sample (>200,000 paired reads, >4000-fold genome coverage) (Supplementary Table 2 ). From the reads aligned to the reference MN908947.3, amplicon corresponding to 6 sgRNAs were detected through split-mapped reads connecting the first 75 nucleotides harboring TRS-L sequences to their respective TRS-B sites. To evaluate their relative abundance among different samples, we normalized the amounts of TRS-L associated junction reads against total numbers of SARS-CoV-2 reads in each sample. Through the normalized junction read counts, we found that the levels of sgRNAs were highly variable, ranging between 0 to 230,155 reads per million (RPM). Between COVID-19 positive individuals with and without symptoms, sgRNA levels were significantly lower in asymptomatic than in the symptomatic samples (median value 3,498 vs. 72,231; two-sided Wilcoxon Rank-Sum Test, p = 4.9 × 10 -12 ) (Fig. 1b) . To ensure that the reduction of sgRNA expression was not resulted from potential lower viral load found in the asymptomatic samples, we further compared the expression of sgRNA per viral gRNA (sgRNA/gRNA) in the asymptomatic vs. symptomatic infections. Here, the levels of gRNAs were defined as the amount of reads aligned uninterrupted across the first 400 nucleotides because their existence was exclusively found in the viral gRNA molecules. As shown in Fig. 1c , significantly lower ratio of sgRNA/gRNA (19-fold in median value, two-sided Wilcoxon Rank-Sum Test, p = 5.6 × 10 -12 ) was observed in asymptomatic hosts, suggesting the lower levels of sgRNAs were independent of virus quantity in these samples. The relative abundance of sgRNAs to gRNAs can also be reflected through the read coverage along the first 400 nucleotides (Fig. 1d) . Here, the distinct differences of sgRNA/gRNA ratio can be observed by the apparent degrees of differential coverage from the first 75 nucleotides (present in both sgRNAs and gRNAs) to the 76-400 nucleotides (only present in gRNAs) visualized through the Integrated Genomics Viewer (IGV) (Fig. 1e) , which clearly indicated the existence of higher amount of sgRNAs in the symptomatic samples. To evaluate if the reduction of sgRNAs occurred selectively in specific sgRNA species or broadly to all sgRNA transcription, we further compared the levels of each gRNA species detected between symptomatic vs. asymptomatic samples. The expression levels of individual sgRNA species were determined by assigning each TRS-associated junction read to their respective sgRNA origins based on their corresponding TRS-B site usage. Among the 6 sgRNAspecific amplicons produced in the amplicon-seq, all but one (sgRNA_E) displayed significant reduction (two-sided Wilcoxon Rank-Sum Tests, p-values 2 × 10 -7 to 9 × 10 -12 ) (Fig. 1f ). Among them, sgRNA_M exhibited the highest degrees (6-37 fold) of decline. Collectively, these results indicated that the lack of active viral transcription in the asymptomatic infection and the sgRNA to gRNA ratio in the host cells appears to reflect the degree of disease severity. The differential sgRNA abundance detected in COVID-19 positive samples between symptomatic and asymptomatic patients implicates their potential function in eliciting host responses. To characterize their expression in the infected cells of symptomatic patients, we adopted an unbiased metagenomic RNA-seq approach to survey the types of sgRNAs expressed and quantitatively evaluate their relative abundance in these samples (Supplementary Table 3 ). In metagenomic RNA-seq analysis, both host and SARS-CoV-2 RNAs expressed were comprehensively revealed by the sequencing of the extracted total RNAs. Using the centrifuge algorithm 28 , we conducted full metagenome profiling and taxonomy classification to assess their relative ratio between human and SARS-CoV-2. Despite their relative low Ct values (13) (14) (15) (16) (17) (18) (19) , suggesting of high viral content, the ratio of reads aligned to SARS-Cov-2 were highly variable among these samples, ranging from 0.06% to 78% (Fig. 2a) . We next characterized the types and abundance of sgRNAs expressed in these samples. The TRS-L associated RNA-seq reads were assigned to each of the nine distinct sgRNA species based on their spans across the corresponding TRS-B junction sites closest to the annotated transcript initiation sites. The abundance of SARS-CoV-2 sgRNAs had no correlation with the viral load inferred by the Ct values from RT-qPCR testing (Spearman correlation coefficient = -0.10, p = 0.50) (Fig. 2b) , suggesting that the viral nucleic acid shedding measured by the RT-qPCR diagnostic assays does not reflect the activity of viral replication in these samples. The relative abundance of different sgRNA species exhibited a remarkable consistency both in their expression ranking (Fig. 2c ) and the relative proportion of the reads for each sgRNA class (Fig. 2d ). Across all samples. SgRNA_N was expressed the highest and sgRNA_ORF7b was the least abundant. It is worth noting that sgRNA_ORF7b was not previously detected in in vitro infected cell cultures 22 . The low expression of sgRNA_ORF7b could be resulted from the imprecision of TRS usage in the discontinuous transcription process. Unlike the other sgRNA species which were mostly transcribed from the annotated TRS sites, 54% of the sgRNA_7b transcripts have adopted an alternative TRS-B' site MN908947.3:27485 (Fig. 2e) . These observations suggested that ORF7b expression is subjected to high variability and could be dispensable in vivo. When comparing the relative abundance of sgRNAs to these reported from in vitro Vero cells experiments, 7 sgRNAs exhibited significant difference (p-value < 1e-05) with the most striking difference found in the sgRNA_Spike (S) (Fig. 2d) . In primary human samples, sgRNA_S expressed at less than 1% of total sgRNAs but was found at 14% of total expressed sgRNAs in the cultured Vero cells. The difference could be contributed by the differences in SARS-CoV-2 transmission and entry between the in vitro cell cultures and primary tissues. The expression of sgRNA_ORF10 was not detected, consistent with what has been described in SARS-CoV-2 infected cell cultures 21 . It has been reported that novel deletions in sgRNAs may have an impact on the clinical presentation of SARS-CoV-2 infection 23 and transmission rate 29,30 . We therefore examined the structural deletions in SARS-CoV-2 RNAs found within symptomatic and asymptomatic individuals. Through the split-aligned reads that were not mediated from the TRS sites in the amplicon-seq data, we detected up to 10 4 per million of SARS-CoV-2 paired reads harboring TRS-independent junctions of at least 20 nucleotides in each sample. These deletion events were more prevalent in viral samples from symptomatic hosts (two-sided Wilcoxon Rank-Sum Test, p = 2.3 × 10 -8 ) (Fig. 3a) , potentially due to more active viral replication in these hosts resulting in greater production of structural variants. In total, we detected 8,551 unique deletions in viral RNAs that were supported by ≥ 2 independent reads. While vast majority of them were sporadic events occurred in isolated cases, 501 (6%) deletions were consistently observed in >10% of samples; either specific in symptomatic (n=375), asymptomatic hosts (n=38) or in both (n=88) (Fig. 3b) . It is interesting to note that, in symptomatic cases, these frequent structural deletions were not only more abundant but also significantly larger in sizes (median spans 198 vs 46 nucleotides, p = 1.6 × 10 -15 ), pointing to a potential selection force for different types of viral variants adapted in distinct cohorts of host responses. These deletions were spread across the entire viral genome (Fig. 3c) . To investigate the existence of distinct sets of deletions in viral RNAs selected in hosts with differences in disease severity, we examined their relative abundance (defined by normalized counts of read support) and frequencies (defined by the proportions of symptomatic vs asymptomatic samples found). We revealed 296 deletions significantly enriched in the symptomatic and 10 deletions in asymptomatic infections (p-value <0.05) (Supplementary Table 4 ). Among them, 263 and 9 deletions were exclusively found in symptomatic and asymptomatic specimens, respectively. We were particularly interested in the 10 deletions preferentially found in the asymptomatic hosts The widespread and abundant deletions arisen in the symptomatic infections drew our attention to investigate their diversity and impact on viral sgRNA transcription. The observed viral variants presumably resulted from deletions occurring either during viral replication or transcription (Fig. 4a) . To distinguish their origins and characterize their impact on the viral translated protein products, we examined these deletions in the context of their associated sgRNA structures by full-length (FL) Iso-seq sequencing 33 . From 10 samples with the highest ratio of SARS-CoV-2 content, we generated in total over two million of high-quality FL cDNA sequences (Supplementary Table 5 ). Of which, 632,207 (31%) of them were SAR-CoV-2 origins and were further clustered into 15,244 distinct transcript units (TUs) supported by ³ 2 FL cDNA sequences (Fig. 4b) . Based on their alignments across TRS-L and their respective canonical TRS-B junction sites, 1,114 FL TUs can be unambiguously assigned to sgRNA origins (Fig. 4c) while 4,591 FL TUs aligned uninterrupted across TRS-B site and were determined as products from viral gRNAs (Fig. 4b) . When we examined the presence of deletions in these FL TUs, vast majority of the deletions were independently detected in both the sgRNA-and gRNA-derived FL TUs. Their validity was further supported by the breakpoints inferred from the split reads in the metatranscriptome RNA-seq data, suggesting that these were bona fide deletions occurred during viral gRNA replication as a result of low fidelity of RNA polymerases. These structural variants were subsequently propagated into protein-coding sgRNAs via transcription. Taking a TU of sgRNA_ORF3a as an example, this TU comprised 4 distinct deletions of 31, 34, 36 and 1,371 nucleotides, respectively which were independently uncovered by short-read RNA-seq data (Fig. 5a ). The same deletions can be also found in in multiple TUs encoding distinct sgRNAs including sgRNA_E, _M and _ORF6 (Fig. 5b) . Overall, from total of 15,244 FL TUs, 3,537 (23%) TUs harbored minimally one insertion or deletion over ≥20 bases, which raises the possibility that a substantial population of the SARS-CoV-2 virus carry structural variations during active infection. Therefore, structural variations of SARS-CoV-2 often lead to alternative sgRNA transcripts and significant alterations in their translation products. These variants potentially exist as quasispecies to facilitate evolutionary selection and host adaptation as observed in other RNA viral species [34] [35] [36] [37] . Through placing the co-occurred insertions and deletions onto the individual FL transcripts, we can investigate the precise impacts of these variants on the viral protein translation. From the collection of the 1,114 sgRNA-derived FL cDNA sequences, 23% of these transcripts carrying frameshifts with >35aa predicted translated protein products of truncations (20.1%), extension Our work also showed distinct and recurring sets of viral RNA deletions in both symptomatic and asymptomatic infections. Their consistent and preferential detection in multiple COVID-19 positive cases point to the genome instability as a source of viral proteome complexity and potential evolutionary selection for host adaptation. Taken together, when associated together with the host genetics and immune response, the sgRNA expression and structural diversity could provide insight in understanding host-viral interactions, evolution and transmission. This, in turn, will guide risk mitigation, testing strategies, and inform future vaccine development. Samples for the clinical diagnosis purpose were collected by a combination of nasal, oral, nasopharyngeal and oropharyngeal swabs between April to September 2020. Patient age ranged Short-read RNA sequencing and data processing RNA-seq libraries were prepared with KAPA mRNA HyperPrep Kit (Roche) according to manufacturer's instruction. First, poly-A + RNA was isolated from 1ul of total RNA extracted from clinical samples using oligo-dT magnetic beads. Purified RNA was then fragmented at 85°C for 6 mins, targeting fragments range 250-300bp. Fragmented RNA is reverse transcribed with an incubation of 25°C for 10mins, 42°C for 15mins and an inactivation step at 70C for 15mins. This was followed by second strand synthesis and A-tailing at 16°C for 30mins and 62°C for 10min. A-tailed, double stranded cDNA fragments were ligated with Illuminacompatible adaptors with Unique Molecular Identifier (UMI) (IDT). Adaptor-ligated DNA was purified using Ampure XP beads (Beckman Coultier). This is followed by 17 cycles of PCR amplification. The final library was cleaned up using AMpure XP beads. Quantification of libraries were performed using real-time qPCR (Thermo Fisher). Sequencing was performed on Illumina Novaseq paired-end 149 bases with indexes and 9 bases of UMI. Raw paired-end reads were trimmed, potential source classified, and mapped per documented above (Amplicon data processing). Reads deduplication were performed with UMI-tools (v1.0.1) 49 . The aligned paired end reads CIGAR was parsed for jumps and deletions (represented by CIGAR operations N or D of size ≥20 bases). Samples with ≥100 UMI-deduplicated split-aligned read-pairs are considered (n=45). The sgRNA abundance inter-sample normalized by a scale factor of 1,000,000/total number of UMIdeduplicated mapped read-pairs, giving a comparable measure unit (junction-)read-pair per million (RPM) The sample viral load is calculated by transforming the Ct value with 2 to the power of (27-Ct). The value 27 is chosen to allow calculated values to be comparable to the numbers of junction-read per million reads. We followed 21 definition of read classification for sgRNA with a modification. We still required that the split read junction to mark the leader-to-body junction and that the translated protein product from the concatenated sequence produces the canonical sgRNA. However, we require that split read 5' site of deletion is mapped to a genomic position between 59 and 79 (TRS-L: 70-75 nt), instead of 55 and 85 21 . This is established based on the sequence identity between the leader and body regions. For comparable gRNA read count (with respect to sgRNAs read counts), we require that the read must harbor no junction, must overlap the genomic position 1 to 85, and its mate read must mapped within the first 1000 base of the genome. The relative abundance of a sample's sgRNA is, thus, the sgRNA read counts over the sum of the sample's gRNA and all sgRNAs read count. DNBseq RNA sequencing data of SARS-CoV-2-infected Vero cell 21 was downloaded. The data was processed, and expression computed exactly per our short-read RNA sequencing data. Total RNA extracted from nasopharyngeal swabs were prepared according to Iso-seq Express Template Preparation (Pacbio). Full length cDNA is generated using NEBNext Single Cell/ Low Input cDNA synthesis and Amplification Module in combination with Iso-seq Express Oligo Kit. Amplified cDNA is purified using ProNex beads. For samples with lower than 160ng in yield, additional PCR cycles is added. cDNA yield of 160ng-500ng were then underwent SMRTbell library preparation including a DNA damage repair, end repair and A-tailing and finally ligated with Overhang Barcoded Adaptors. Libraries were then pooled and sequenced on Pacbio Sequel II. The raw sequencing data generated were processed with the SMRT Link (v 8.0.0.80529) Iso-Seq analysis pipeline with the default parameters. Firstly, circular consensus sequences (CCSs) were generated from the raw sequencing reads. Demultiplexed CCSs based on sample barcodes in the adaptors, were further classified into full length, non-chimeric (FLNC) CCSs and non-full length, non-chimeric CCSs based on the presence of chimera sequence, sequencing primer and 3' terminal poly-A sequence. FLNC CCSs (which contains both the 5'-and-3'-adaptor sequence along with the poly-A tail) were clustered to generate isoforms. Only the high-quality (accuracy≥0.99) transcript isoforms (referred here as TUs) were aligned to the SARS-CoV-2 genome reference (MN908947.3) with pbmm2 (v1.1.0). The aligned TU's CIGAR was parsed for gaps (represented by CIGAR operations N or D of size ≥20bases). The identified gaps were clustered based on their aligned genomic coordinates. The maximum difference amongst the cluster members' gap start (and end) coordinates is 10 bases. For TU with multiple transcribed segments, and its first segment 3' site mapped to the genomic position 59-79, the TU is considered TRS-L mediated. The translation products of the TUs were predicted by translating the sequence with standard genetic code upon the first AUG (Methionine) encountered. The translation product is annotated against Conserved Domain Database (CDD) including 55,570 position-specific score matrices (PSSMs) 40 . All data described in this study has been deposited in NCBI's Sequence Read Archive PRJNA690577. (https://www.ncbi.nlm.nih.gov/bioproject/690577) CLW, CHW and CYN are co-inventors on a patent application submitted by The Jackson Laboratory entitled "Subgenomic RNAs for Evaluating Viral Infection". The other authors declare no conflict of interest. Occurrence and transmission potential of asymptomatic and presymptomatic SARS-CoV-2 infections: A living systematic review and meta-analysis Estimating the extent of asymptomatic COVID-19 and its potential for community transmission: Systematic review and meta-analysis COVID-19: in the footsteps of Ernest Shackleton Early viral clearance and antibody kinetics of COVID-19 among asymptomatic carriers. medRxiv The natural history and transmission potential of asymptomatic SARS-CoV-2 infection Clinical characteristics of 24 asymptomatic infections with COVID-19 screened among close contacts in Nanjing Comparison of Clinical Characteristics of Patients with Asymptomatic vs Symptomatic Coronavirus Disease SARS-CoV-2 Infections Among Children in the Biospecimens from Respiratory Virus-Exposed Kids (BRAVE Kids) Study. medRxiv What the data say about asymptomatic COVID infections Suppression of a SARS-CoV-2 outbreak in the Italian municipality of Vo' Presymptomatic SARS-CoV-2 Infections and Transmission in a Skilled Nursing Facility Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region Nextstrain: real-time tracking of pathogen evolution Continuous and Discontinuous RNA Synthesis in Coronaviruses Origin and evolution of pathogenic coronaviruses Virological assessment of hospitalized patients with COVID-2019 The group-specific murine coronavirus genes are not essential, but their deletion, by reverse genetics, is attenuating in the natural host Severe acute respiratory syndrome coronavirus group-specific open reading frames encode nonessential functions for replication in cell cultures and mice Characterisation of the transcriptome and proteome of SARS-CoV-2 reveals a cell passage induced in-frame deletion of the furin-like cleavage site from the spike glycoprotein The Architecture of SARS-CoV-2 Transcriptome Pervasive generation of non-canonical subgenomic RNAs by SARS-CoV-2 Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: an observational cohort study Improvements to the ARTIC multiplex PCR method for SARS-CoV-2 genome sequencing using nanopore A rapid, cost-effective tailed amplicon method for sequencing SARS-CoV-2 The COVID-19 XPRIZE and the need for scalable, fast, and widespread testing Centrifuge: rapid and sensitive classification of metagenomic sequences The ORF3a protein of SARS-CoV-2 induces apoptosis in cells Viruses and apoptosis Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing Cooperation between distinct viral variants promotes growth of H3N2 influenza in cell culture Viral quasispecies evolution SARS-CoV-2 Quasispecies Mediate Rapid Virus Evolution and Adaptation. bioRxiv Evolution of viral quasispecies during SARS-CoV-2 infection Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Variant analysis of SARS-CoV-2 genomes CDD/SPARCLE: the conserved domain database in 2020 Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses The coronavirus nucleocapsid is a multifunctional protein Preliminary Identification of Potential Vaccine Targets for the COVID-19 Coronavirus (SARS-CoV-2) Based on SARS-CoV Immunological Studies SARS-CoV-2 SPIKE PROTEIN: an optimal immunological target for vaccines SARS-CoV-2 Virus Culture and Subgenomic RNA for Respiratory Specimens from Patients with Mild Coronavirus Disease Cutadapt removes adapter sequences from high-throughput sequencing reads STAR: ultrafast universal RNA-seq aligner UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy Subgenomic RNAs (sgRNAs) specific amplicons can be identified by specific amplification with 5' forward primer closest to the transcription regulatory sequence leader (TRS-L) site and 3' reverse primers closest to the TRS-Body (TRS-B) sites followed by sequencing and alignment. Distribution of sgRNAs normalized counts (b) and sgRNA to genomic RNA (gRNA) ratio (p=4.9 × 10 -12 ) (c) between symptomatic and asymptomatic cases (p = 5.6 × 10 -12 ). d. Sequencing coverage across the 5' 400 nucleotides of SARS-CoV-2 genome shows the contribution from sgRNAs and gRNA, respectively. e. Amplicon-seq coverages across 5' 400 nucleotides from representative symptomatic and asymptomatic cases. f. Distribution of the normalized counts of individual sgRNA species measured in symptomatic and asymptomatic cases The authors thank all Jackson Laboratory Clinical Laboratory team members for their effort in samples collection and processing of covid-19 samples; and the Jackson Laboratory Genome Technologies team members for their sequencing effort. We also thank Linda Choquette and her Figure 2 : Expression of subgenomic RNAs (sgRNAs) in the clinical specimens from symptomatic patients. a. The percentages of SARS-CoV-2 (blue) and human (red) reads detected in each of the symptomatic samples (n=51). b. Correlation analysis between viral load (RT-qPCR Ct values) and sgRNA abundance (numbers of junction reads per million). c. Transcription regulatory sequence (TRS) usage. Percentages of sgRNA-derived junction reads split at their corresponding known TRS-Leader (TRS-L) and TRS-Body (TRS-B) sites for each sgRNA species and the relative abundance ranking. d. Proportions of reads assigned to genomic RNA (gRNA) and each sgRNA species in symptomatic samples (n=45) and Vero cultured cells (n=1). Center line, median; boxes, first and third quartiles; whiskers, 1.5 × the interquartile range; points, outliers. e. Sequences at the alternative TRS-B sites used by sgRNA_ORF7b transcription. Distributions of normalized split-aligned reads counts in asymptomatic and symptomatic patients. Two-sided Wilcoxon Rank-Sum Test, p = 2.3 × 10 -8 . Center line, median; boxes, first and third quartiles; whiskers, 1.5 × the interquartile range. b. Deletions inferred by amplicon-seq data from asymptomatic and symptomatic patients' specimens. c. Visualization of the deletions detected in symptomatic (n=287), asymptomatic (n=34) and both (n=79) samples in IGV genome browser in reference annotated subgenomic RNA (sgRNA) transcribed regions. d. Top: Deletions (n=10) preferentially found in viral RNAs from the asymptomatic samples. Middle: zoom-in view in sgRNA_ORF3a coding sequence (CDS) region shows the two deletions uniquely found in asymptomatic cases, their normalized counts and representative read supports. Lower: their predicted translated peptide in reference to the wildtype ORF3a peptide.