key: cord-0805496-noqbcxua authors: Wu, Kevin E.; Fazal, Furqan M.; Parker, Kevin R.; Zou, James; Chang, Howard Y. title: RNA-GPS Predicts SARS-CoV-2 RNA Residency to Host Mitochondria and Nucleolus date: 2020-06-20 journal: Cell Syst DOI: 10.1016/j.cels.2020.06.008 sha: ec13aa702dae87b8f4fd10fa62f18201b54b9e40 doc_id: 805496 cord_uid: noqbcxua Abstract/Summary SARS-CoV-2 genomic and subgenomic RNA (sgRNA) transcripts hijack the host cell's machinery. Subcellular localization of its viral RNA could thus play important roles in viral replication and host antiviral immune response. We perform computational modeling of SARS-CoV-2 viral RNA subcellular residency across eight subcellular neighborhoods. We compare hundreds of SARS-CoV-2 genomes to the human transcriptome and other coronaviruses. We predict the SARS-CoV-2 RNA genome and sgRNAs to be enriched towards the host mitochondrial matrix and nucleolus, and that the 5’ and 3’ viral untranslated regions contain the strongest, most distinct localization signals. We interpret the mitochondrial residency signal as an indicator of intracellular RNA trafficking with respect to double-membrane vesicles, a critical stage in the coronavirus life cycle. Our computational analysis serves as a hypothesis generation tool to suggest models for SARS-CoV-2 biology and inform experimental efforts to combat the virus. A record of this paper’s Transparent Peer Review process is included in the Supplemental Information. COVID-19 (coronavirus disease 2019) has become a global pandemic, fueled by the rapid spread of the coronavirus SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2), a positive strand RNA virus (Wu et al., 2020a , Sanche et al., 2020 . The scientific community is actively trying to understand SARS-CoV-2's biological mechanisms and effects. Here, we computationally analyze the subcellular localization patterns of SARS-CoV-2 RNA transcripts. Our results suggest potential avenues for experimental validation and follow-up, while providing a template for in silico analyses of viral RNA. RNA subcellular localization is critical to a myriad of cellular processes (Ryder and Lerit, 2018 , Chin and Lécuyer, 2017 , Buxbaum et al., 2015 . Researchers have also discovered that RNA localization plays a significant role in the life cycle of viruses, with functions ranging from regulating sites of virion assembly (Becker and Sherer, 2017) to disrupting host mitochondrial function (Somasundaran et al., 1994) . However, the subcellular localization of the SARS-CoV-2 RNA, and other coronaviruses, is largely unexplored. Gaining a better understanding of the behavior and localization of SARS-CoV-2's RNA genome and transcripts can lead to a better understanding of its function and pathogenicity, potentially revealing targetable mechanisms. To computationally study this aspect of SARS-CoV-2 biology, we built upon our recent work developing RNA-GPS, a state-of-the-art computational model predicting high-resolution RNA localization in human cells (Wu et al., 2020b) . RNA-GPS was trained on transcriptome-wide localization patterns of human RNAs across eight subcellular landmarks (Fazal et al., 2019) . RNA-GPS's strong performance, coupled with viruses' dependence on hijacking and repurposing existing cell machinery for reproduction, suggests that RNA-GPS could provide insights into SARS-CoV-2's localization behavior and can focus future experimental efforts. We use RNA-GPS to interrogate the dominant subcellular residency patterns of SARS-CoV-2's genome, which spans approximately 30 kilobases of single-stranded positive-sense RNA (Kim et al., 2020 ) ( Figure 1A ). RNA-GPS predicts that SARS-CoV-2 and the transcripts it forms have enriched residency at the nucleolus and the mitochondria. We note that our analysis may suggest potential localization mechanisms for SARS-CoV-2, rather than direct physical localization, particularly with regard to our mitochondrial prediction. Comparison of SARS-CoV-2's predicted residency with that of other human coronaviruses, including strains causing the common cold, Middle East respiratory syndrome (MERS), and the SARS outbreak of 2003, shows that SARS-CoV-2 exhibits a stronger mitochondrial and nuclear residency signal than a large majority of its coronavirus relatives. We additionally find that this residency signal appears to be driven by the 5' and 3' ends of the viral genome. We conclude by connecting our predictions to known RNA and viral biology and proposing possible explanatory mechanisms for previously observed phenomena. Our findings entreat experimental validation and serve as a framework for applying machine learning for principled hypothesis generation in viral biology. We leverage our recent work developing RNA-GPS, a computational model predicting high-resolution RNA subcellular localization in human cells (Wu et al., 2020b) . Our model was built using APEX-seq data, which fuses the engineered peroxidase APEX2 protein to various protein localization sequences ( Figure 1B , Table S1 ) to guide APEX2 to each subcellular region for subsequent proximity biotinylation of nearby RNAs (Fazal et al., 2019) . The resultant transcripts captured and measured at the nucleolus, for example, are transcripts proximal to APEX2 in the nucleolus, as well as transcripts proximal to APEX2 throughout its entire lifecycle including its transport to the nucleolus. Such "en route" transcripts constitute a small proportion of total transcripts, except in the notable case of the mitochondrial matrix COX4 marker, which picks up a sizable proportion of nuclear-encoded transcripts as it is imported to the mitochondria ( Figure S1A ). Though this is surprising, these nuclear-encoded mitochondrial-enriched transcripts are reproducibly distinct from noise (Figure S1B, S1C) and actually enrich for cytoskeletal and intracellular transport processes ( Figure S1D ). For the sake of brevity, we will refer to these measurements using their final destinations, as confirmed by imaging: the cytosol, endoplasmic reticulum, mitochondrial matrix, outer mitochondrial membrane, nucleus, nucleolus, nuclear lamina, and nuclear pore. RNA-GPS predicts localization to each of these eight neighborhoods ( Figure 1B ). Although RNA-GPS is trained on human, not viral, RNA transcripts, its ability to generalize across cell types not used in training, combined with the fact that viruses commandeer human cellular machinery suggest that it offers a reasonable hypothesis of viral transcript localization behavior given currently available data. Nonetheless, there is inherent uncertainty associated with generalizing our model across species, and we use the term dominant subcellular residency to indicate this predictive uncertainty where appropriate. We consider viral transcript subcellular residency predictions to each compartment averaged across all released and annotated SARS-CoV-2 genomes available as of April 6, 2020 (n = 213) on GenBank (Coordinators, 2018) . SARS-CoV-2 is believed to enter the cell as a positive strand genomic RNA, subsequently forming 11 positive strand sub-genomic RNA (sgRNA) transcripts encoding different open reading frames and sharing the same 5' leader sequence and 3' untranslated region (UTR) ( Figure 1A ). Within each viral genome, we predict the residency of each sgRNA produced from the primary SARS-CoV-2 genome. To better understand how strong these predicted residency probabilities are in a meaningful biological context, we frame them relative to predictions for other relevant baseline transcript sequences. We consider two such baselines: the distribution of model predictions on transcripts exhibiting significant localization within the human HEK293T cell line (n = 366 transcripts) as measured by APEX-seq (Fazal et al., 2019) , and the distribution of model predictions on transcripts derived from human coronaviruses, excluding SARS-CoV-2 (n = 191 genomes, spanning diseases from the common cold to MERS, Table S2 ). The human baseline quantifies the strength of RNA residency signals in SARS-CoV-2 relative to naturally occurring human transcripts with well-characterized localization behaviors. The coronavirus baseline focuses on differences in the transcript residency behavior of SARS-CoV-2 relative to similar viral specimens -differences that may help researchers focus on the peculiarities of this virus. For both baselines, we calculate the proportion of the baseline distribution that the SARS-CoV-2 localization prediction exceeds, which we refer to as a rank score. For example, a residency rank score of 0.6 for the nucleolus relative to human transcripts suggests that the particular viral RNA is more likely to have been picked up by the nucleolus APEX-seq marker compared to 60% of human RNAs that are empirically measured to do so. We find that compared to transcripts with known localizations in human cells, SARS-CoV-2 has a notable residency signal towards the nucleolus, as well as the mitochondrial matrix ( Figure 1C ). These residency signals are consistent across different sgRNAs encoded by the virus (shown in each row, Figure 1C ), and represent statistically significant predicted residency (Table S3 ). The nucleolus is known to play a prominent role in the viral life cycle, even for viruses that primarily replicate in the cytoplasm as SARS-CoV-2 presumably does (Salvetti and Greco, 2014) . While some RNA viruses like human immunodeficiency virus (HIV) exhibit transcript localization to mitochondria (Somasundaran et al., 1994) , there has not been direct evidence that SARS-CoV-2 does this. As previously discussed, since much of the APEX-seq mitochondrial data used to train RNA-GPS actually consists of nuclear-encoded transcripts likely picked up as the APEX-COX4 fusion protein is transported to the mitochondria, we hypothesize that our predicted mitochondrial residency is alluding to similarity in localization pathways, rather than localization destination. In addition to framing our localization results in the context of endogenous human transcripts, we also compare predicted residency of SARS-CoV-2 sgRNAs to that of other human coronaviruses ( Figure 1D ). Here, we observe similar overall trends in our residency predictions. Consistent with the comparison to human transcripts, we find the SARS-CoV-2 mitochondrial matrix residency signal is stronger than that of many other coronaviruses. Additionally, we see an overall pattern suggesting that SARS-CoV-2 may have a greater affinity for nuclear neighborhoods (nuclear pore, nucleus, nucleolus, and nuclear lamina) compared to other coronaviruses. We also compared the dominant subcellular residency patterns of the coronavirus family (excluding SARS-CoV-2) with human transcripts using RNA-GPS. We found that the most prominent residency signals for general human coronaviruses pointed towards the nucleolus, mitochondrial matrix, and ER membrane ( Figure S2 ). Overall, our computational analysis suggests that SARS-CoV-2's predicted sgRNA transcript residency enriching for the mitochondrial matrix and nucleolus may be amplifications of behaviors that were already present in coronaviruses. While direct experimental data measuring coronavirus sgRNA transcript localization is not currently available, we sought to validate our predictions on other human viruses with known subcellular localizations. After conducting a systematic literature search, we found one such example: the human cytomegalovirus β2.7 mRNA transcript, which localizes to the inner mitochondrial membrane (Williamson et al., 2012) and is approximately 2.5 kilobases long. RNA-GPS predicts this transcript to reside at the mitochondrial matrix with a rank score of 0.81; no other compartments have a rank score exceeding 0.5 ( Figure 2A ). Thus, the algorithm's residency prediction is in close agreement with experimental evidence for β2.7 mRNA localization. While large-scale comparisons are not currently feasible due to lack of datasets measuring viral transcript localization, this example provides some reassurance that RNA-GPS' predicted viral residencies are reasonable. To further validate the robustness of these results, we also trained a different predictive algorithm (a recurrent neural network, see STAR Methods for additional details) on the APEX-seq data and performed a similar set of experiments, comparing SARS-CoV-2 dominant subcellular residency predictions to human and coronavirus baselines ( Figure S3A /B). This alternative model also predicts strong mitochondrial matrix and nucleolus residency for SARS-CoV-2. Since this algorithm uses a very different modeling strategy from RNA-GPS and nonetheless converges to similar findings, this suggests that the mitochondrial matrix and nucleolus residency predictions are not artifacts of a particular computational modelling strategy and increases our confidence in our findings. In addition to evaluating robustness of our results to modelling strategies, we also evaluated robustness with respect to the APEX-seq data used to train the models. As we previously mentioned, many APEX-seq transcripts used to train RNA-GPS's mitochondrial predictions are actually nuclear-encoded. These transcripts exhibited relatively low (albeit significant) enrichment compared to transcripts natively encoded in the mitochondrial genome. To ensure our results have not been driven by potentially noisy data, we excluded nuclear-encoded, "non-canonical" mitochondrial matrix transcripts with relatively low APEX-seq enrichment signal (lowest 20% of log fold change enrichment scores), and retrained RNA-GPS on this adjusted dataset. This denoised model recapitulates the same SARS-CoV-2 residency towards the mitochondrial matrix and nucleolus ( Figure 2B ), suggesting that our predictions are robust to noise in the training data. In summary, our predicted residencies are robust across different modelling strategies, and across variation in the data used to train these models. SARS-CoV-2 negative strand RNA also shows residency to mitochondria and nucleolus During their replication life cycle, coronaviruses like SARS-CoV-2 copy their positive strand RNA to create a negative strand RNA that serves as the template for viral "transcription" and production of sgRNAs (Wu and Brian, 2010) . We applied RNA-GPS to the negative strand SARS-CoV-2 sgRNA precursors and discovered that they also exhibit residency to the mitochondrial matrix and nucleolus ( Figure S4 ). This result suggests that the sequence features driving these residency patterns are independently present in both positive and negative strand RNAs, further boosting the localization capability of SARS-CoV-2 during different stages of its viral cycle. In addition to predicting residency, our computational model can also help understand which regions of the transcript may be more responsible for driving these predictions. At a high level, this can be done by evaluating which features were most important for RNA-GPS's predictions. We specifically investigated the potential contribution of the three main regions of the SARS-CoV-2 sgRNAs: the shared 5' leader sequence, the shared 3' UTR, and the variable "coding" sequence in the middle. We predicted residency for each of these regions by itself ( Figure 1E ). The 5' leader sequence shows the strongest residency signal for the mitochondrial matrix, and relatively low signal for the nucleolus. In contrast, the 3' UTR has the strongest residency for the nucleolus and also has a strong signal for the mitochondrial matrix. The coding sequence (CDS) also shows specific signals for these two compartments. As the 5' and 3' sequences are shared by the different SARS-CoV-2 sgRNAs, this is likely a strong factor behind the consistent residency patterns we predict across the different sgRNAs. We also performed further computational ablation studies of RNA binding protein (RBP) motifs in SARS-CoV-2. However, computational deletions of all instances of each individual RBP motif, repeated across all enriched RBPs, did not significantly alter the RNA-GPS residency predictions. This result suggests that the SARS-CoV-2 residency signal could be abundant in the viral genome and may involve complex interactions not captured by relatively short single RBP binding motifs. In this work, we apply computational models of human RNA transcript localization to better understand the subcellular localization behavior of the SARS-CoV-2 genome and its constituent sgRNAs. This approach builds upon the idea that the virus uses existing human cell machinery to reproduce, and consequently that sequence-based localization signals are likely shared between human and coronavirus transcripts. The strengths of this approach include (1) the potential to understand viral RNA localization without the risk of live viral cultures; (2) the ability to examine hundreds of viral isolates and related coronaviruses and thousands of RBP motif ablations; (3) the ability to examine viral genes, UTRs, and negative strands individually, which may otherwise require the ability to precisely synchronize and arrest the viral life cycle. We find that SARS-CoV-2 appears to harbor strong transcript residency signals towards the mitochondrial matrix and nuclear compartments, often comparable to human RNAs and more so than other coronaviruses. This intriguing hypothesis suggests future experimental exploration and validation. As we mentioned previously, we believe that our predicted mitochondrial residency signal is more indicative of a localization pathway than a destination -in the context of coronavirus biology, this may specifically be related to double membrane vesicles (DMVs). Coronaviruses are known to produce DMVs to serve functions like concealing the virus from cellular defenses (Hagemeijer et al., 2012 , Knoops et al., 2008 . While these DMVs are generally believed to be formed via viruses manipulating the ER membrane (Blanchard and Roingeard, 2015) , the mechanism for importing and packaging proteins and RNA into these miniature organelles is not as clearly understood. One possible mechanism for importing viral RNA involves the virus exploiting RNA localization mechanisms that the cell already possesses for endogenous double-membrane organelles: namely, the mitochondria. Indeed, introducing just two amino acid point mutations in the murine coronavirus causes both a significant drop in the number of DMV structures observed, as well as a sharp increase in viral protein localization at the mitochondria (Clementz et al., 2008) . This alludes to a high degree of resemblance between DMV and mitochondrial localization mechanisms -leading to our hypothesis that our mitochondrial matrix residency predictions are capturing this similarity between the DMV and mitochondria. Furthermore, DMVs have been shown to contain double-stranded RNA (Hagemeijer et al., 2012) ; our strand-agnostic residency predictions are concordant with this evidence and might even encourage formation of such complexes. Under this model, SARS-CoV-2's strong mitochondrial residency signal relative to other coronaviruses may even contribute to its similarly high infectivity by increasing its efficacy in forming these DMV structures. Another possible interpretation of these predicted residencies is that previously studied viral protein localizations are influenced by transcript-level localizations, a mechanism that is highly prevalent for proteins in normal human cells (Blower, 2013) . Protein-protein interaction studies performed on SARS-CoV-2 have found that its NSP5 (within ORF1a), NSP13 (within ORF1b), ORF6, and ORF10 proteins interact with host proteins that predominantly localize to nuclear compartments (Gordon et al., 2020). The same study found that the ORF9b protein, produced by the "N" sgRNA, interacts with TOMM70, a mitochondrial import receptor that plays a critical role in modulating interferon response -a key antiviral cellular defense pathway (Liu et al., 2010) . In both cases, localized viral transcripts could help drive viral protein localization, enabling more focused protein-protein interactions. A limitation of our work lies in that it applies models trained on human RNA transcript localization data to viral transcripts. It is possible that SARS-CoV-2 infection could alter the host subcellular structures and RNA transport machinery so drastically that our learned localization patterns from human cells no longer hold. If RNA-GPS's predictions turn out to be wrong for this reason, this might suggest that coronavirus infection devastates host cell RNA trafficking and localization -a previously unrecognized feature of COVID-19 pathobiology. After all, the vast majority of RNA binding proteins in the host cell, which are key drivers of transcript localization, recognize and process RNAs irrespective of whether they are endogenous or foreign, and inability to "properly" localize viral RNAs should mirror a similar breakdown for host cell transcripts. As we are unable to use existing experimental evidence to thoroughly evaluate and cross-reference the predictions discussed here, future experiments in this vein are clearly necessary. Given the historical scarcity of studies focusing on viral transcript localization, such experiments would likely reveal interesting, crucial insights into viral pathobiology, whether they confirm our specific mitochondrial and nucleolus predictions or not. It is worth pointing out, though, that this is but one of many complex, interconnected viral mechanisms at play. In summary, we build upon recent computational models of RNA subcellular localization to study, in silico, the localization properties of SARS-CoV-2 transcripts. Our results suggest that predicted transcript residency signals, specifically towards the nucleolus and mitochondrial matrix, may be important, unique characteristics of SARS-CoV-2 that warrant additional study. We connect these observations to known viral biology regarding DMV structures in viral replication, as well as SARS-CoV-2 protein localization patterns. In doing so, we propose potential cellular mechanisms that underpin viral biologymechanisms that warrant experiments validating their accuracy, and perhaps even their potential as therapeutic targets. More broadly, we hope that our study helps define a framework for applying machine learning models to enable focused hypothesis generation, enabling similar studies that leverage data science to rapidly respond to emerging epidemiological challenges. In the interest of transparency, the following changes were made in this paper during review. We used "RNA subcellular residency" rather than "RNA localization" to describe RNA-GPS prediction results, as this is more reflective of the underlying APEX-seq training data and its inherent differences compared to viral transcripts. We clarified the origin, specificity, and interpretation of nuclear-encoded transcripts enriched by the COX4-APEX2 mitochondrial matrix landmark, thus enhancing our interpretation of this predicted localization. We added a positive control showing a CMV mRNA with known mitochondrial localization is correctly predicted by RNA-GPS. We thank the reviewers and editor for insightful comments that have improved this work. For context, the complete Transparent Peer Review Record is included within the Supplemental Information. U01MH098953 and grants from the Silicon Valley Foundation and the Chan-Zuckerberg Initiative. F.M.F. is supported by an NIH K99/R00 award from NHGRI (HG010910). Author Contributions H.Y.C. and J.Z. conceived the idea for this project and supervised its execution. K.E.W. gathered, preprocessed, and analyzed data for this project with input from all authors. F.M.F. and K.R.P. contributed analysis of mitochondrial APEX-seq and FISH data with input from all authors. All authors contributed to interpreting localization results in the context of coronavirus biology. K.E.W. wrote the manuscript with input from all authors. Declaration of Interests K.R.P. is a consultant for Maze Therapeutics. H.Y.C. is affiliated with Accent Therapeutics, Boundless Bio, 10x Genomics, Arsenal Bio, and Spring Discovery. J.Z. is affiliated with InterVenn Biosciences. Figure 1: Depictions of the SARS-CoV-2 genome (A), the eight compartments that RNA-GPS predicts viral transcript residency to (B), and the predicted residencies for SARS-CoV-2 sgRNAs (C, D) and its 5'/CDS/3' sequence segments (E). The SARS-CoV-2 genome produces a series of sub-genomic RNAs (sgRNAs), each encoding one or more genes/proteins (A). These sgRNAs share a common leader 5' sequence and a common trailing 3' UTR sequence (arrow blocks). For each sgRNA, RNA-GPS predicts residency to each compartment in (B). Italicized text indicates the APEX2 fusion protein used to measure transcripts corresponding to each localization (see Table S1 ). (C) Heatmap of rank scores, indicating how strongly each sgRNA (rows) is predicted to exhibit subcellular residency at each compartment (columns), compared to endogenous human transcripts measured to localize to that compartment. Colors indicate rank scores; color scale is shared across all heatmaps. Most sgRNAs share similar residency patterns, exhibiting statistically significant enrichment towards the mitochondrial matrix and nucleolus (see Table S3 ). We also computed these rank scores against a baseline of other coronavirus residency signals (D). SARS-CoV-2 exhibits a stronger mitochondrial matrix residency signal than most other coronaviruses, along with greater overall nuclear residency, particularly at the nucleolus. For context, coronaviruses are generally predicted to have residency at the nucleolus, mitochondrial matrix, and ER membrane (see Figure S2 ). These predictions are also consistent across different models (see Figure S3 ) and negativestrand SARS-CoV-2 sgRNA precursors (see Figure S4 ). (E) Shows the predicted residency rank scores for shared 5' and 3' segments, and an averaged residency rank score for the variable coding segments. Even on their own, the short ~90-250 base pair 5' and 3' segments carry mitochondrial and nucleolar residency signals. Figure 1E ). For additional analysis of the mitochondrial dataset and predictions, see Figure S1 . Resource Availability Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Howard Chang (howchang@stanford.edu). This computational study did not generate or use new reagents. The data supporting the findings of this study are all available within publicly available repositories as listed in the Key Resources Table. All code required to query and download viral sequences, as well as to reproduce results and figures can be found within the GitHub repository listed in the Key Resources Table. All software dependencies for RNA-GPS and the SARS-CoV-2 analysis described herein are freely available as well. Within the GitHub repository, most code pertaining to SARS-CoV-2 analysis can be found under the "covid19" folder; other folders contain supporting data and source code. Obtaining viral genomes SARS-CoV-2 viral genomes were programmatically queried from the NCBI GenBank online database using the BioPython library's Entrez module (Cock et al., 2009) . The exact query sequence used can be found within the "covid19/covid19.py" file in the GitHub repository. Returned results were then filtered to retain only assemblies that included annotated, named sgRNA "genes." We consider the sgRNAs corresponding to ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N, and ORF10, as these have the most consistent annotations. In cases where the shared 5' leader sequence or the 3' tail were not explicitly annotated, their regions were inferred to be the 5' and 3' trailing bases outside of any coding regions, respectively. As there are many SARS-CoV-2 genome assemblies that fit these criteria, subcellular residency predictions are averaged across all genomes. Viral genomes constituting the coronavirus baseline follow an identical process, save for using a different NCBI GenBank query sequence that specifically fetches matches to the six coronaviruses known to infect humans (excluding SARS-CoV-2): 229E, NL63, OC43, HKU1, MERS-CoV (beta coronavirus that causes Middle East Respiratory Syndrome, or MERS), and SARS-CoV (the beta coronavirus that causes severe acute respiratory syndrome, or SARS) (Su et al., 2016) . The exact query sequence used can be found in the "covid19/baseline.py" source file in the GitHub repository. A detailed breakdown of the exact number of genomes we use from each strain is in Table S2 . The human cytomegalovirus was chosen for additional evaluation based on a systematic literature review of viral RNA localization studies. This is the only example we found that associates a specific viral transcript with a consistent experimentally validated localization. The viral sequence for validation of our model predictions was obtained from the NCBI GenBank reference sequence NC_006273.2. Due to lack of standardized 5' and 3' UTR region annotations for this transcript (despite these being referenced in the literature), we manually determined these regions after reviewing literature and the overall genome annotation. RNA-GPS uses k-mer featurization with k = 3, 4, 5, applied independently to the 5' untranslated region (UTR), coding sequence (CDS), and 3' UTR parts of the transcript (Wu et al., 2020b) . This creates a feature space of (4 3 + 4 4 + 4 5 ) x 3 = 1344 x 3 = 4032 dimensions. These features are then consumed by a random forest model (implemented using the scikit-learn Python library) to generate localization/residency predictions. Extending this definition to the coronavirus sgRNA sequences, we consider the shared 5' leader sequence the fixed 5' UTR input to our model, shared 3' UTR sequence the fixed 3' UTR input to our model, and the variable sgRNA sequence the "CDS" input. For sake of consistency with sgRNA transcript mechanisms, this "CDS" sequence includes the current reading frame, along with any 3' downstream bases until the shared 3' UTR region begins. Each sgRNA is individually assigned predicted residencies. RNA-GPS's per-segment featurization also enables the per-segment residency analysis. For this, we selectively provide the model with only features that correspond to a single segment (i.e. the 5' UTR, CDS, or 3' UTR), with zero values for other features. For the deep recurrent model, we implemented and trained a recurrent neural network that consumes raw bases as input, maps these to a 32-dimensional embedding layer, passes these through two 64dimensional gated recurrent units (GRU), and finally a fully-connected layer with sigmoid activation producing 8 localization/residency predictions. This flavor of GRU network is popular in sequence modelling and uses "gating" mechanisms to improve learning of longer-range sequence dependencies (Chung et al., 2014) . The model was implemented in PyTorch and was trained to minimize a binary cross-entropy loss using the Adam optimizer (Kingma and Ba, 2014 ) with a batch size of 1, with early stopping based on validation set area under the receiver operating characteristic (AUROC). Both RNA-GPS and the GRU model are trained and tuned on the same APEX-seq data, measuring localization within HEK293T cells (Fazal et al., 2019) . Localization within this dataset is expressed as an enrichment score compared to the rest of the cell. We consider transcripts that exhibit significant enrichment (log fold change (logFC) > 0 and adjusted p-value ≤ 0.05) for at least one of the eight measured compartments (n = 3660). Many transcripts contain more than one significant localization. Furthermore, due to the nature of the APEX-seq technology, transcripts measured at a specific compartment may also contain transcripts that were picked up as the APEX2 labelling protein itself was being transported to that compartment. This effect is usually minimal, except for mitochondrial transcripts (see Figure S1 ). We use data splits of 80% train (n = 2928), 10% validation (n = 366), and 10% train (n = 366). As is conventional, the validation set was used for hyperparameter tuning and model architecture tuning. When removing potentially spurious mitochondrial examples, we start with the above dataset and remove all transcripts that measured to localize to the mitochondrial matrix but have log fold change enrichment in the bottom 20th percentile of localized mitochondrial matrix transcripts. This removes the bottom 20% of mitochondrial sequences with the lowest enrichment relative to the rest of the cell (n = 61) -this denoised dataset contains 240 mitochondrial matrix transcripts instead of 301, and a total of 3599 transcripts compared to 3660 previously. We use a database of 102 RNA binding protein binding motifs (Ray et al., 2013) . To identify matches, we use the same methodology as was used in the RNA-GPS manuscript (Wu et al., 2020b) . We start with the position weight matrix (PWM) that describes the motif, adjust its probabilities to account for the background nucleotide composition of each transcript sequence, define a cutoff score slightly lower than the maximum achievable log-likelihood for that PWM, and identify any subsequences that exceed that cutoff. When ablating these PWMs, we use the same methodology for identifying hits, and subsequently replace all hits with "N" bases, re-featurizing the ablated sequence as necessary before feeding into the model, thus generating the ablated localization predictions. Baseline construction and rank score Baseline distributions are constructed by running a set of baseline transcript sequences through a model predicting transcript localization/residency. For each individual model, there is a per-localization baseline derived from human APEX-seq measurements, and one derived from human coronaviruses excluding SARS-CoV-2. For each localization neighborhood within the human baseline, we consider only transcripts that exhibit significant localization to that neighborhood, as defined by having a logFC > 0 and adjusted p-value ≤ 0.05 when running differential expression analysis against the remainder of the cell. Additionally, we only use transcripts not used for model training/tuning (i.e. the test data split), as this most closely approximates what the model would predict when presented with novel sequences. For the coronavirus baseline, we do not have systematically measured localization data, so we cannot constrain this baseline using known localizations behaviors. Instead, each SARS-CoV-2 sgRNA is compared only to homologous sgRNAs from other coronaviruses. For example, the spike protein's residency prediction is only compared against residency predictions of other coronavirus spike proteins. This limits our comparison to the set of genes with easily traceable homology across human coronaviruses, namely ORF1ab, spike (S), envelope (E), membrane (M), and nucleocapsid (N) (Woo et al., 2010) . For both these baselines, we define a rank score as the proportion of baseline values that a SARS-CoV-2 sgRNA residency prediction exceeds. A hypothetical value of 0.5 would correspond to a median, 0.25 would correspond to the first quartile, etc.; rank score is thus bound between 0 and 1 (inclusive). Note that this rank is calculated for each individual compartment separately, as the baselines themselves are compartment specific. As previously discussed, subcellular residency predictions are averaged across all valid SARS-CoV-2 genomes prior to calculating rank scores. Furthermore, since the human baseline is constrained by measured localizations, whereas the coronavirus baseline is constrained by sequence homology, rank scores for these two baselines are not directly comparable. In addition to computing the rank scores described above, we also evaluate whether these rank scores correspond to significant enrichment. To do this, we compare the underlying predicted residency probabilities (not the rank scores) against a "null" distribution of localization probabilities for human transcripts exhibiting no significant localization. We do this using a one-sided Wilcoxon rank-sum test (scipy Python package (Virtanen et al., 2020) , with the hypothesis that residency probabilities exceed that of the null distribution. Our data satisfies the Wilcoxon rank-sum test's assumptions of independence, and our residency/localization prediction probabilities are naturally ordinal. To address the fact that we do multiple comparisons, we use the Holm method (statsmodels Python package (Seabold and Perktold, 2010) ) to correct the resultant p-values. The sequential FISH experiments were described in (Fazal et al., 2019) , and resulted in data for 29 transcripts retained for further analysis. The analysis was described in (Fazal et al., 2019) , and briefly consists of compiling imaging data from 20 fields of view, each with > 20 cells, and with the data processed using MATLAB. For the quantification of each field of view, a mask was generated for each gene of interest using a uniform threshold cutoff of 0.5-0.998, after removing non-cell pixels. The colocalization score with mitochondria was calculated by interesting the mask of a particular gene with the mask of the mitochondrial-resident transcript MT-ND3, and then dividing the summed intensity of the intersected mask by the summed intensity of the gene masks of interests. The quantification results for all 20 fields of view were then averaged to obtain the final number. To perform gene ontology enrichment analysis, we used the PANTHER tool (Mi et al., 2018) provided by the Gene Ontology Consortium (Ashburner et al., 2000 , The Gene Ontology, 2019 . Genes were compared in an overrepresentation test against a reference list of all genes in the Homo sapiens database using Fisher's Exact test, with false discovery rate correction. The annotation used was "Reactome version 65." Plots were generated using a combination of seaborn and matplotlib Python packages (Hunter, 2007) . Highlights (4-5 bullets, 85 characters each) • Application of machine learning model of RNA subcellular localization to SARS-CoV-2 • Viral RNAs show residency signal for host mitochondria and nucleolus • Mitochondria prediction suggests viruses repurpose endogenous pathways • Predictions may be linked to vesicle formation and viral-host protein interactions eTOC Blurb (limit 50 words) Where the SARS-CoV-2 genome localizes inside human cells remains understudied but may regulate viral replication and host response. We use a machine learning model to predict subcellular residency of the SARS-CoV-2 genome and its encoded transcripts, as well as for other coronaviruses. Our predictions suggest new hypotheses for SARS-CoV-2 mechanisms. The table highlights the genetically modified organisms and strains, cell lines, reagents, software, and source data essential to reproduce results presented in the manuscript. Depending on the nature of the study, this may include standard laboratory materials (i.e., food chow for metabolism studies), but the Table is not meant to be comprehensive list of all materials and resources used (e.g., essential chemicals such as SDS, sucrose, or standard culture media don't need to be listed in the Table) . Items in the Table must also be reported in the Method Details section within the context of their use. The number of primers and RNA sequences that may be listed in the Table is restricted to no more than ten each. If there are more than ten primers or RNA sequences to report, please provide this information as a supplementary document and reference this file (e.g., See Table S1 for XX) in the Key Resources Table. Please report the information as follows: • REAGENT or RESOURCE: Provide full descriptive name of the item so that it can be identified and linked with its description in the manuscript (e.g., provide version number for software, host source for antibody, strain name). In the Experimental Models section, please include all models used in the paper and describe each line/strain as: • SOURCE: Report the company, manufacturer, or individual that provided the item or where the item can obtained (e.g., stock center or repository). For materials distributed by Addgene, please cite the article describing the plasmid and include "Addgene" as part of the identifier. If an item is from another lab, please include the name of the principal investigator and a citation if it has been previously published. If the material is being reported for the first time in the current paper, please indicate as "this paper." For software, please provide the company name if it is commercially available or cite the paper in which it has been initially described. • IDENTIFIER: Include catalog numbers (entered in the column as "Cat#" followed by the number, e.g., Cat#3879S). Where available, please include unique entities such as RRIDs, Model Organism Database numbers, accession numbers, and PDB or CAS IDs. For antibodies, if applicable and available, please also include the lot number or clone identity. For software or data resources, please include the URL where the resource can be downloaded. Please ensure accuracy of the identifiers, as they are essential for generation of hyperlinks to external sources when available. Please see the Elsevier list of Data Repositories with automated bidirectional linking for details. When listing more than one identifier for the same item, use semicolons to separate them (e.g. Cat#3879S; RRID: AB_2255011). If an identifier is not available, please enter "N/A" in the column. o A NOTE ABOUT RRIDs: We highly recommend using RRIDs as the identifier (in particular for antibodies and organisms, but also for software tools and databases). For more details on how to obtain or generate an RRID for existing or newly generated resources, please visit the RII or search for RRIDs. Please use the empty table that follows to organize the information in the sections defined by the subheading, skipping sections not relevant to your study. Please do not add subheadings. To add a row, place the cursor at the end of the row above where you would like to add the row, just outside the right border of the Table S1 (related to Figure 1 ): APEX2 fusions used to measure localization of transcripts. APEX2 is responsible for labelling, while the protein (segments) it is fused to drive its localization. Additional information regarding the APEX-seq protocol and data can be found in the original APEX-seq manuscript (Fazal et al., 2019) , particularly Figure S2 . Transcripts picked up by APEX2 (both en route and upon arrival at each fusion's final destination) are used to train the RNA-GPS model. , and is used to localize APEX to the mitochondria as shown in this illustration. Many transcripts thus picked up by COX4 that nominally localize at the mitochondrial matrix are actually nuclear-encoded. We hypothesize that these are picked up as the APEX2-COX4 fusion is transported from cytosol to mitochondria (final arrow). (B) Sequential FISH data showing fraction of transcripts colocalizing at the mitochondria (using the mitochondrial-resident MTND5 RNA as a mitochondrial marker, as described in (Fazal et al., 2019) ). Nuclear transcripts like XIST and NEAT1 do not show mitochondrial enrichment, while transcripts known to localize to the outer surface of the mitochondria like SCD and IARS2 are enriched, providing negative and positive controls, respectively. Within this range, "non-canonical" nuclear- Shows a plot of APEX-seq log fold-change enrichment scores at each compartment for the 251 mitochondrial-enriched, nuclear-encoded "non-canonical" transcripts used to build RNA-GPS. We see that these transcripts have enrichment centered around 0 for all but the mitochondrial matrix, indicating that while these transcripts are nuclear-encoded, the APEX-seq labelling technology consistently and uniquely associates them with the mitochondrial matrix, and are thus not noise. These transcripts are also biologically meaningful, as shown by a reactome ontology analysis of the 100 most enriched (by p-value) nuclear-encoded mitochondrial matrix transcripts (D). There is a clear emphasis for cytoskeletal and intracellular transport terms (e.g. kinesins, post-chaperonin tubulin folding pathway, recruitment of NuMA to mitotic centrosomes; adjusted p < 0.05). This supports the interpretation that many of these non-canonical transcripts are picked up as the APEX-seq protein is itself trafficked to the mitochondria. Figure S2 (related to Figure 1 ): Summary of residency patterns aggregated across all transcripts comprising the human coronavirus baseline. We see that coronaviruses in general primarily exhibit residency towards the nucleolus, mitochondrial matrix, and ER membrane -a pattern similar to that seen in SARS-CoV-2's sgRNAs (albeit less dramatic). Figure 1C shows that the positive strand sgRNA transcripts tend to exhibit residency towards the mitochondrial matrix and nucleolus. Here, we look at the negative-strand precursors to those sgRNAs and observe that these transcripts share similar mitochondrial matrix and nucleolus residency patterns. This suggests another layer of conservation of this predicted residency signal. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium Subcellular Localization of HIV-1 gag-pol mRNAs Regulates Sites of Virion Assembly Virus-induced double-membrane vesicles Chapter One -Molecular Insights into Intracellular RNA Localization In the right place at the right time: visualizing and understanding mRNA localization RNA localization: Making its way to the center stage Empirical evaluation of gated recurrent neural networks on sequence modeling Mutation in murine coronavirus replication protein nsp4 alters assembly of double membrane vesicles Biopython: freely available Python tools for computational molecular biology and bioinformatics Database resources of the National Center for Biotechnology Information Atlas of Subcellular RNA Localization Revealed by APEX-Seq Visualizing coronavirus RNA synthesis in time by using click chemistry Matplotlib: A 2D Graphics Environment The Architecture of SARS-CoV-2 Transcriptome Adam: A Method for Stochastic Optimization SARS-Coronavirus Replication Is Supported by a Reticulovesicular Network of Modified Endoplasmic Reticulum Tom70 mediates activation of interferon regulatory factor 3 on mitochondria PANTHER version 14: more genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools A compendium of RNA-binding motifs for decoding gene regulation Mitochondrial Protein Synthesis Adapts to Influx of Nuclear-Encoded Protein RNA localization regulates diverse and dynamic cellular processes Viruses and the nucleolus: The fatal attraction. Biochimica et Biophysica Acta (BBA) -Molecular Basis of Disease 1842 The Novel Coronavirus, 2019-nCoV, is Highly Contagious and More Infectious Than Initially Estimated. medRxiv Statsmodels: Econometric and Statistical Modeling with Python Localization of HIV RNA in mitochondria of infected cells: potential role in cytopathogenicity Epidemiology, Genetic Recombination, and Pathogenesis of Coronaviruses The Gene Ontology Resource: 20 years and still GOing strong SciPy 1.0: fundamental algorithms for scientific computing in Python Viral product trafficking to mitochondria, mechanisms and roles in pathogenesis Coronavirus genomics and bioinformatics analysis A new coronavirus associated with human respiratory disease in China Subgenomic messenger RNA amplification in coronaviruses RNA-GPS predicts high-resolution RNA subcellular localization and highlights the role of splicing We thank the members of the Chang and Zou laboratories for helpful discussions. We thank Shuo Han, Alistair Boettiger and Alice Ting for generating and analyzing FISH images presented herein. H.Y.C. is supported by RM1-HG007735 and R01-HG004361. H.Y.C. is an Investigator of the Howard Hughes Medical Institute. J.Z. is supported by NSF CCF 1763191, NIH R21 MD012867-01, NIH P30AG059307, NIH