key: cord-0329495-jh3edvhu
authors: Agarwal, Vikram; Kelley, David
title: The genetic and biochemical determinants of mRNA degradation rates in mammals
date: 2022-03-19
journal: bioRxiv
DOI: 10.1101/2022.03.18.484474
sha: 043d873471d9966f75cae5a91ad063b1ba9f6883
doc_id: 329495
cord_uid: jh3edvhu

Background Degradation rate is a fundamental aspect of mRNA metabolism, and the factors governing it remain poorly characterized. Understanding the genetic and biochemical determinants of mRNA half-life would enable a more precise identification of variants that perturb gene expression through post-transcriptional gene regulatory mechanisms. Results Here, we establish a compendium of 54 human and 27 mouse transcriptome-wide mRNA decay rate datasets. A meta-analysis of these data identified a prevalence of technical noise and measurement bias, induced partially by the underlying experimental strategy. Correcting for these biases allowed us to derive more precise, consensus measurements of half-life which exhibit enhanced consistency between species. We trained substantially improved statistical models based upon genetic and biochemical features to better predict half-life and characterize the factors molding it. Our state-of-the-art model, Saluki, is a hybrid convolutional and recurrent deep neural network which relies only upon an mRNA sequence annotated with coding frame and splice sites to predict half-life (r=0.77). Saluki predicts the impact of RNA sequences and genetic mutations therein on mRNA stability, in agreement with functional measurements derived from massively parallel reporter assays. Conclusions Our work produces a more robust “ground truth” with regards to transcriptome-wide mRNA half-lives in mammalian cells. Using these consolidated measurements, we trained a model that is over 50% more accurate in predicting half-life from sequence than existing models. Our best model, Saluki, succinctly captures many of the known determinants of mRNA half-life and can be rapidly deployed to predict the functional consequences of arbitrary mutations in the transcriptome.

The steady-state level of RNA is governed by two opposing forces: the rate of transcription and the rate of decay. While much headway has been made in the problem of predicting steady-state mRNA abundances through the lens of DNA-encoded features that influence transcription rates [1] [2] [3] [4] [5] , relatively less is known about the RNA-encoded determinants that govern decay rates.

Models to predict half-life from RNA sequence in mammals (achieving r 2 =0.20 and r 2 =0.39) [6, 7] have severely lagged behind the performance of those in yeast (achieving r 2 =0.59) [8] .

Integrating the modeling of transcription and RNA decay into a unified model to predict steady-state mRNA levels [2] from genetic sequences would elucidate the full spectrum of regulatory functions of non-coding DNA. Consequently, this would advance the goals of providing a mechanistic explanation for evolutionary constraint in conserved sequences [9] [10] [11] , identifying causal eQTLs [12, 13] , diagnosing pathogenic non-coding genetic variants [14] , and designing more stable and effective mRNA therapeutics [15] .

The rate of mRNA decay is experimentally measured as its half-life, or the time elapsed until the initial RNA concentration has decreased by half. Experimental measurement of transcriptome-wide half-life can be achieved by one of two strategies: i) the application of transcriptional inhibitors [e.g., Actinomycin D (ActD) and α-Amanitin] to cells, or ii) a pulse-chase based method of pulsing modified nucleosides (e.g., 4sU, 5EU, and BrU) followed by chasing with unmodified nucleosides to distinguish newly synthesized RNA from pre-existing RNA [16] . Both strategies are followed by the profiling of RNA levels over a time course, and data for each gene are fit to an exponential decay curve to ascertain the gene's half-life. Although it has long been appreciated that different methods induce measurement bias [16] [17] [18] , many modern studies that deploy these methodologies fail to acknowledge the prevalence and impact of these biases on result interpretation.

Measurement biases emerge for a multitude of reasons. Transcriptional inhibitors are convenient, but do not necessarily enter all cells to fully block transcription [16] . Moreover, such drugs can lead to cytotoxicity [17] or impact unintended pathways such as translation, thus preventing the attachment of mRNA to ribosomes and resulting in artificially altered mRNA metabolism [16, 18] . Pulse-chase methods are inaccurate in the scenario in which the half-life is shorter than the chase period, and cellular parameters such as doubling time and drug uptake further bias half-life measurement [18] . Finally, the incorporation of uridine analogs is thought to be a stochastic process that is proportional to the number of Us in an RNA, leading to RNA-length-dependent labeling and enrichment biases [19, 20] . Collectively, the ultimate consequence of these biases is the disagreement between different methods in the physiologically relevant estimate of half-life [16] .

It has been postulated that considering an ensemble of different methods would empower a more precise measurement of half-life, providing a path towards circumventing the tradeoffs and limitations among any individual method [18] . While a meta-analysis of half-life datasets in yeast revealed surprisingly discordant results among half-lives measured by different research groups [21] , the consistency among half-life datasets in mammalian organisms remains largely uncharacterized. Collecting such a compendium of datasets would potentially enable the derivation of consensus measurements of cellular mRNA half-life in a fashion that is less obfuscated by technical noise and methodological bias.

A precise measurement of RNA half-life would enable a clear-eyed examination of how different molecular pathways modulate half-life relative to one another. Numerous RNA-binding proteins (RBPs) and sequence-encoded features have been implicated in regulating half-life. Examples include: i) generic features of an mRNA such as its GC content [22] , length, and ORF exon junction density [6, 7] ; ii) the presence of microRNA (miRNA) binding sites [23] [24] [25] ; iii) codon frequencies and interactions with the translation machinery [8, [26] [27] [28] [29] [30] ; iv) mRNA structure [31, 32] ; v) Pumilio binding elements [33] ; vi) AU-rich elements (AREs) [16, 34] ; and vii) YTHDF proteins [35] via m6A recognition [36, 37] .

Attempts to examine the relative contribution of sequence and biochemical features to the specification of half-life [6] [7] [8] [38] [39] [40] have been undermined by half-life measurement biases, and have not exhaustively considered all of the known pathways that affect RNA stability. In this study, we assembled a compendium of 54 human and 27 mouse mRNA half-life datasets to derive more precise, consensus measurements of RNA stability in mammalian cells. Using our enhanced measurements, we derived improved genetic and biochemical models towards the goals of quantifying the relative influence of different pathways and improving the predictability of half-life from such features. Our state-of-the-art model Saluki, a hybrid convolutional and recurrent neural network, is capable of predicting the effects of RNA sequences and genetic variants therein on RNA stability, in agreement with functional measurements derived from massively parallel reporter assays.

To generate a compendium of mammalian half-life datasets, we mined the literature for published human and mouse half-life data. In total, we identified 33 publications reporting transcriptome-wide half-lives, with 16 submitting human data, 14 submitting mouse data, and 3 submitting data from both species ( Table 1) . Counting individual replicates, this led to a total of 54 human and 27 mouse half-life measurements. The human studies encompassed 10 cell types and 5 measurement procedures, while the mouse studies encompassed 8 cell types and 3 procedures ( Table 1) . Each of the human studies reported half-lives for between~2,500-13,000 genes (Additional file 1: Fig. S1 ), while those from the mouse reported half-lives for betweeñ 500-18,000 genes (Additional file 1: Fig. S2a ) (Additional file 2: Dataset S1).

In order to evaluate the consistency among studies, we computed the pairwise Spearman correlations of all datasets, using the subset of common genes measured by each pair of studies.

Most pairs of human datasets exhibited strong (i.e., ≥0.6) correlations (Fig. 1) . However, the datasets segregated largely into two clusters; the first encompassed a diverse array of studies, cell types, and procedures, and the second encompassed datasets from only five studies [27, [41] [42] [43] [44] which appeared to be outliers in that they exhibited poor (i.e., ≤0.5) correlation to most studies other than to those from their own batch (Fig. 1) . Most subclusters did not cleanly segregate by cell type or experimental method; rather, they tended to cluster closely by their study or [29, 38, 46, 49 ] whose data were deposited as degradation rates rather than half-lives. Samples are clustered using hierarchical clustering according to the indicated dendrogram. Rows are labeled by the study of origin ( Table 1) and colored by the cell type of origin and measurement approach.

laboratory of origin. This indicated the likely presence of batch effects which masked the cell-type specific signal captured by the data. Datasets from the mouse exhibited similar patterns of clustering, except that only two studies [60, 64] appeared to be the greatest outliers due to their poor correlation to most other studies (Additional file 1: Fig. S2b) .

Comparison of methodological bias and cell-type specificity captured in the mammalian half-life compendium Given the wide disparity in reported genes (Additional file 1: Fig. S1 and S2a) and potential existence of outliers in some samples, we pre-processed our gene x sample human and mouse half-life matrices to improve our ability to evaluate sample relatedness and examine possible sources of measurement bias. We standardized the samples in each matrix, used iterative PCA to impute missing gene measurements, and performed quantile-normalization to align the samples into similar distributions. In total, we recovered 13,921 human genes and 14,463 mouse genes in our matrices (Additional file 3: Dataset S3). Finally, we performed PCA to evaluate the relatedness among the 54 human samples. As observed previously ( Fig. 1) , PC1 identified samples from a single study [44] to be the greatest outliers relative to all other studies (Additional file 1: Fig. S3 ). We therefore found it parsimonious to assume that measurements from this study were severely biased, and henceforth excluded samples from this study from further analysis.

Reperforming the PCA analysis on the remaining 39 human samples revealed weak to non-existent clustering by cell type, and stronger clustering by measurement method (Fig. 2a) .

Specifically, PC2 seemed to segregate samples derived from pulse labeling experiments (i.e., Figure 2 . Assessment of measurement bias and cell-type specificity present in half-life data. a) PCA of all human samples except those from an outlier study [44] , with sample names colored according to cell type and corresponding data point colored according to measurement approach. Axes are labeled according to the percentage of variance among samples explained by the first two PCs. See also Additional file 1: Fig. S3 for the same analysis using all samples. b) Boxplot of sample distributions along PC2, partitioned according to the measurement method (i.e., pulse labeling or transcriptional shutoff). Replicates for the same study were first averaged according to their PC2 value prior to assessing differences between the methods, with statistical differences between distributions evaluated using a two-sided Wilcoxon rank-sum test. c) Evaluation of the Pearson correlations between pairs of half-life samples. Considered in this plot were the subset of pairs of two different studies that interrogated half-lives in either the same cell type or different cell types. Statistical differences between the distributions were evaluated using a one-sided Wilcoxon rank-sum test to assess whether correlations from the same cell type exceeded those from a different cell type. d-e) These panels are the same as those in (a) and (c), respectively, except compare mouse samples. f) Comparison of consensus, cell-type agnostic (i.e., methodology and cell-type independent) measurements of human and mouse half-lives among one-to-one orthologous genes. Half-lives for each species were computed as PC1 of the respective gene x sample matrix. Also indicated are the Pearson (r) and Spearman (rho) correlation values as well as sample size (n) of genes considered. Shown in all boxplots is the median value (bar), 25th and 75th percentiles (box), and 1.5 times the interquartile range (whiskers). those using 4sU, BrU, and BrU4sU) with those derived from transcriptional shutoff experiments (i.e., those using ActD and α-Amanitin). After averaging the PC2 values among replicates, we observed a statistically significant difference between the PC2 distributions for these two methodological classes (Fig. 2b) , revealing a fundamental inconsistency in these techniques.

To ascertain whether the human data captured cell-type-specific half-lives, we evaluated all pairwise Pearson correlations between half-lives from samples derived from different studies (in order to minimize the influence of inflated correlations among replicates of the same study), but assessed on either the same cell type or different cell types. We hypothesized that we should see stronger correlations among samples evaluating the same cell type; however, we did not observe statistical support for this hypothesis (Fig. 2c) , indicating that there was no detectable cell-type-specific signal for half-lives in human cells. Nearly identical results were achieved using Spearman correlations instead of Pearson correlations (data not shown). Given the known tissue-specific roles of miRNAs [67] and other post-transcriptional RBP regulators, we find it more plausible that technical noise and methodological bias obscure the cell-type specificity of half-life measurements relative to the possibility that none exists biologically.

Next, we again used PCA to evaluate the relatedness among the 27 mouse samples. In the mouse, there appeared to be stronger clustering by cell type and measurement methodology along both PC1 and PC2 (Fig. 2d) . Indeed, Pearson correlations between pairs of half-life measurements from the same cell type and different study exceeded those from different cell types and different studies (Fig. 2e) , suggesting the detection of a cell-type specific signature.

Although this observation is encouraging, the limited sample size, coupled with the confounded nature of measurement methodology and cell type, make it ultimately impossible to decompose the relative influence of methodological bias and cell-type specificity in the measurement of mouse half-lives. In the future, a controlled study measuring half-lives in a number of different cell types with multiple experimental approaches would help to more accurately assess this. Given our lack of success in identifying clear cell-type-specific measurements of half-life in the human and mouse samples, we instead focused on deriving cell-type-agnostic (i.e., universal) measures of half-life which integrate information across all methodologies and cell types.

Towards this end, we computed the first PC in each of our gene x sample matrices to derive consensus measurements of half-life for each gene from each species. The PC1s explained 62.4% and 62.6% of variance among genes in each species, respectively, while PC2s explained merely 6.6% and 9.8% of variance, so were henceforth not used. Given that this procedure was performed independently in each species, we would expect that any artifacts induced by the procedure would corrupt the evolutionary relationship of measurements between species.

However, we instead observed a strong concordance between our consensus half-life measurements among one-to-one orthologs between the two species (Fig. 2f) . Our interspecies Pearson correlation of 0.78 greatly exceeds that of 0.61 reported previously in the literature [6] .

This indicates that mRNA half-life has been more strongly conserved than previously appreciated between mammalian species separated by~75 million years of evolutionary time.

Moreover, this finding also demonstrates our ability to recover a highly precise measurement of half-life that successfully ameliorates the impact of technical noise and methodological bias.

A genetic model of mRNA half-life Given our consensus measurements of mRNA half-life, we sought to determine whether it was possible to improve the predictability of half-life from mRNA sequence. Towards this end, we engineered groups of features ( Table 2 ) with the goal of deriving a machine learning (ML)-based regression model to automatically select the subset of pertinent features. Our sequence-derived feature groups consisted of: i) basic mRNA features such as the length and G/C content of different functional regions and ORF exon junction density [2, 6, 7] ; ii) k-mer frequencies of length 1-7 in the 5′ UTR, ORF, or 3′ UTR; iii) codon frequencies; iv) predicted repression scores of mammalian-conserved miRNA families [24] ; and v) predicted binding of various RBPs to the mRNA sequence by SeqWeaver [68] and DeepRiPE [69] . RBP binding was predicted separately for the 5′ UTR, ORF, and 3′ UTR because it has previously been observed from cross-linking and immunoprecipitation sequencing (CLIP-seq) that RBPs have a tendency to bind in specific functional regions [70] .

Next, we sought to determine which of these sequence-derived groups of features would be useful for predicting half-life. Because of the hierarchical association of each feature to a group ( Table 2) , we thus evaluated a series of nested models which iteratively considered additional groups of features. Each feature set was fed into a lasso regression model, and the relative model performance was compared between a simpler and more complex group on different folds of held-out data using a 10-fold cross-validation (CV) strategy. This helped to establish whether the inclusion of additional groups was justified. Evaluating the model series on our human half-life data, we observed that: i) 5′-UTR k-mers did not improve the model, ii) ORF k-mers did not improve the model beyond the simpler consideration of codons, iii) both 3′-UTR k-mers and Table 2 . Summary of features considered in models trained to predict mRNA half-life, with a description of the features considered, feature type (i.e., sequence or biochemical), data source, and number of features in the category. Sequence features were calculated identically for both human and mouse. Biochemical features were calculated only with respect to human data.

predicted miRNA repression scores improved the model, and iv) SeqWeaver predictions of RBP binding improved the model more than those from DeepRiPE (Fig. 3a ). Our final model that optimally balanced the tradeoff between complexity and performance was the "BC3MS" model, which considered basic mRNA features, codon frequencies, 3′-UTR k-mers, miRNA repression scores, and SeqWeaver predictions. Concatenating the predictions across all folds of held-out data, we observed a correlation of 0.67 between the BC3MS model predictions and observed human half-lives (Fig. 3b) . We retrained this model on the full dataset to assess which features contributed most significantly to half-life prediction. The top-30 ranked features of the model consisted primarily of basic mRNA properties, codon frequencies, and predicted miRNA/RBP binding sites (Fig. 3c) . Consistent with previous work [2, 6, 7] , ORF exon junction density was the dominant feature. Moreover, the signs of coefficients associated with codon frequencies are consistent with previously published human Codon Stability Coefficients (CSCs) [26, 27, 29, 30] , which were computed based upon an isolated Pearson or Spearman correlation between the codon frequencies and half-lives [28] .

CSCs have been previously established as a quantitative metric that captures the association between a codon and mRNA stability [26] [27] [28] [29] [30] . Our model contextualized relative codon influence with respect to other sequence features in a multiple linear regression framework, and re-ranked their utility in the prediction task.

Next, despite considering over 20,000 possible 3′-UTR k-mers, the model automatically discovered highly conserved regulatory motifs such as UAUUUAU, the core AU-rich element (ARE) [16, 34] ; UGUAAAU and its variant UGUAUA, Pumilio binding elements [33] ; and AAUAAA, the cleavage and polyadenylation motif involved in alternative polyadenylation [81, 82] . The binding of four RBPs emerged as the most useful features, including that of SAFB2

and RBM22 in the ORF as well as HNRNPUL1 and SRSF3 in the 3′-UTR. Finally, consistent with previous work showing that miRNAs weakly impact half-life [7] , the model predicted light repressive roles for two miRNAs, miR-19a and miR-25.

Although we find these factors to be promising candidates, we caution that the interpretation of feature selection and coefficient-based ranking is inherently limited by the substantial degree of multicollinearity among features. To guard against the possibility of lasso regression selecting one feature over another through its spuriously higher correlation to half-life, we examined the full set of features strongly correlated to the top 30 selected features (Fig. 3d) . While the majority of features were not strongly correlated to other features, we indeed found that Fig. S4d ).

Finally, we evaluated the ability of the models trained in each species to generalize to the opposite species. Our interspecies comparisons revealed that models tested in the opposite species performed competitively, albeit slightly worse, than models trained within the same species, for both the human and mouse (Additional file 1: Fig. S4e ). Table 2) . Implementing the same strategy as before to compare nested models with 10-fold CV, we observed that: i) PAR-CLIP data did not improve the model, ii)

RIP-seq data did not improve the model after jointly considering eCLIP and m6A CLIP data, and iii) the remaining datasets all improved the model, with the greatest benefit emerging from considering the ENCORI database (Fig. 4a) . "BEeM" was our best model, which considered basic mRNA features, ENCORI CLIP, eCLIP, and m6A CLIP data. We observed a global correlation of 0.77 between the BEeM model's held-out predictions and observed human half-lives (Fig. 4b) .

The top-30 ranked features of the model (i.e., retrained on the full dataset) consisted primarily of the ORF exon junction density and dozens of RBPs (Fig.   4c) . A number of RBPs consistently emerged between our genetic (Fig. 3c, . 4c) .

Collectively, these factors were strongly correlated to others which serve as candidate regulators of mRNA stability (Fig. 4d) . Fig. S5c,d) .

A deep-learning based genetic model of mRNA half-life

Having compared the performance of genetic and biochemical models, we sought to evaluate whether an alternative learning paradigm might be able to automatically decipher sequence-based rules that are more predictive of mRNA half-life. Towards this goal, we trained a hybrid convolutional and recurrent deep neural network architecture, called Saluki, which has the advantage of potentially uncovering nonlinear (e.g., cooperative) relationships among motifs and spatial principles with respect to motif positioning. The input to our model was a binary encoding of the mRNA sequence (i.e., up to a maximum of 12,288 nt), concatenated with binary variables for each nucleotide indicating the presence or absence of the first reading frame of a codon and splice site (Fig. 5a) . This matrix was fed in to a neural network comprised of 64 1D

convolutions (width 5), a max pooling layer (width 2), 6 additional blocks of these aforementioned layers, a recurrent layer consisting of a gated recurrent unit (GRU), and a densely connected layer (Methods, Fig. 5a , and Additional file 1: Fig. S6a) . Given that mRNA sequences are variable in length, the 3′ end of each mRNA was padded with Ns after the transcriptional end site. To account for this property of the input, we left-justified the input matrix, and the GRU was oriented to pass information from the right to left such that the sequence was still most recently considered by the network (i.e., as opposed to the padded Ns) when integrating information from the full sequence to make a prediction. The parameters comprising this common "trunk" of the network were jointly trained on human and mouse data in alternating batches of species-specific data, with two "heads" to predict human and mouse half-lives, respectively [4, 5] .

We performed an ablation analysis in which the model was evaluated after training it with only sequence as input; sequence and reading frame tracks; and sequence, reading frame, and splice site tracks. This analysis revealed that all forms of information were utilized by the network for Saluki to achieve optimal performance (Additional file 1: Fig. S6b) . Training our Saluki model with the identical folds of data as our lasso regression models enabled us to directly compare their respective performances on each of the 10 folds of held-out data. Averaging the predictions using models derived from each of five independent training runs for each fold, Saluki's performance exceeded that of the genetic lasso regression models for each species (i.e., "BC3MS" for human and "BC3MSD" for mouse), and it performed nearly identically with the human biochemical regression model (i.e., "BEeM") (Fig. 5b) . The final human and mouse models displayed correlations of 0.77 and 0.73, respectively ( Fig. 5c and Additional file 1: Fig.   S6c ), suggesting that Saluki potentially learned novel principles of post-transcriptional gene regulation which our simpler linear models were unable to capture. . The deep learning model, called Saluki, was jointly trained on mouse and human half-life data to predict species-specific half-lives. b) Performance of the trained Saluki models on each of 10 held-out folds of data, relative to the corresponding performances from our best genetic (i.e., "BC3MS" for human and "BC3MSD" for mouse, respectively) and biochemical (i.e., "BEeM") lasso regression models. An improvement relative to another model was evaluated with a two-sided, paired t-test. c) Shown are the final predictions after concatenating the observations for all 10 folds of held-out data. Also indicated are the Pearson (r) and Spearman (rho) correlation values. d) Metagene plot of ISM scores across all mRNAs for percentiles along the 5′ UTR, ORF, and 3′ UTR. mRNAs were grouped into one of 4 bins according to their predicted half-lives. For the set of mRNAs within each bin, we plotted the average of the absolute value of the mean predicted effect size (i.e., of the three possible alternative mutations). e) ISM results of two 3′-UTR segments from TUBGCP3 and PI4K2B. Partial matches to the AU-rich element (ARE, or "UAUUUAU") and Pumilio/FBF (PUF, or "UGUAHAUA") binding element consensus sequences are boxed. For each motif, single point mutations resulting in particularly severe or opposite phenotypes are shown alongside annotations reflecting the corresponding ARE and PUF consensus gain or loss events. f) Insertional analysis of motifs discovered by TF-MoDISco [83] . Each motif was inserted into one of 50 positional bins along the 5′ UTR, ORF, and 3′ UTR of each mRNA. Indicated is the average predicted change in half-life for each bin plotted along a metagene. g) This panel is the same as panel f), except it performs analysis of 61 codons (excluding the 3 stop codons) inserted into the first reading frame along the length of the ORF. Selected codons are colored, with the rest shown in gray. h) Scatter plot showing the relationship between the mean influence of each codon along the length of the ORF, as predicted by Saluki in panel g), and the mean codon stability coefficient over a set of cell types as observed previously [26] . Also indicated are the Pearson (r) and Spearman (rho) correlation values.

In the hopes of revealing such principles, we tested the predictive behavior of Saluki in different contexts. First, we performed an in silico mutagenesis (ISM) for every human mRNA in our dataset, mutating each reference nucleotide of an mRNA into every alternative allele to evaluate the predicted change in half-life [1, 4, 5, 84] . We generated a metagene plot using these scores, evaluating the absolute value of the mean effect size (i.e., of the three possible alternative mutations) in percentiles along the length of each functional region (Fig. 5d) . This analysis revealed that the model predicts a relatively modest influence for the 5′ UTR relative to the ORF and 3′ UTR. Moreover, the termini of functional regions, such as the 3′ terminus of the 5′ UTR as well as the 5′ and 3′ termini of the ORF and 3′ UTR, were revealed to harbor the most informative positions along an mRNA that contribute to half-life (Fig. 5d) .

We further interrogated our ISM scores to identify the most pertinent motifs associated with changes in half-life using TF-MoDISco [83] . The poly-A binding element ("AAAAAAA"), ARE ("UAUUUAU"), PUF element ("UGUAHAUA"), putative ELAVL1/2 element ("UUUAU"), polyadenylation element ("AAUAAA"), and m6A motif ("GGACU") were enriched as significant motifs in the 3′ UTR. Of these, the m6A motif and PUF element were also enriched in the ORF and 5′ UTR, and the putative ELAVL1/2 and polyadenylation elements were enriched in the 5′ UTR (Additional file 1: Fig. S6d) . Visualizing ISM scores for two random examples, the 3′-UTR segments from TUBGCP3 and PI4K2B, illustrates several learned properties involving these motifs (Fig. 5e) . Broadly speaking, a mutation ablating a core ARE or PUF element led to a predicted increase in mRNA half-life. Conversely, several point mutations led to a gain of a motif, leading to a predicted decrease in half-life. While a subset of these mutations generated novel motifs, others changed an existing motif closer to its consensus sequence. In some cases, a mutation caused a dual loss of overlapping PUF/ARE motifs, leading to a more severe predicted change in half-life relative to the loss of each individual motif (Fig. 5e) .

Next, we evaluated how the model would behave if we inserted either a splice site or one of our five enriched motifs along the full length of each mRNA. We performed this insertional analysis for most human mRNAs in our dataset, and then averaged the result according to the spatial bin of the insertion along each functional region. Consistent with their known roles, we observed that insertion of an m6A site, ARE, or PUF element reduced mRNA stability, with the greatest effect size arising from a 3′-UTR insertion (Fig. 5f) . In contrast, insertion of a splice site, polyadenylation element, or C-rich element led to enhanced mRNA stability. Consistent with the lasso regression model, the presence of a splice site led to at least a four-fold enhancement in half-life relative to alternative motifs. The most novel property captured in the deep learning model-yet absent from our lasso model-is the strong dependence between the spatial coordinate of the motif along the mRNA and its predicted impact on mRNA half-life, both across and within functional regions (Fig. 5f) . For instance, ARE and PUF sites are predicted to most strongly repress mRNA stability if they occur at the 5′ or 3′ terminus of the 3′ UTR, reminiscent of a well-known property of microRNA-mediated repression [24, 85] . In contrast, the m6A element is predicted to most destabilize mRNA immediately after the stop codon, mirroring the known relationship between m6A deposition and mRNA stability [58] .

Given the mechanistic link between codon usage and mRNA half-life [8, [26] [27] [28] [29] [30] , we sought to ascertain whether Saluki has also learned this property. We therefore reiterated our insertional analysis, this time using 61 codons (excluding the 3 stop codons) inserted into the first reading frame along the length of each ORF. As before, the model attributed substantially different effect sizes to codons depending upon their position along an ORF, with the greatest predicted effects occurring close to the start and stop codons (Fig. 5g) . Although most codons had the greatest predicted effect when inserted closer to the stop codon, several codons such as "CUG", "UUC", "GAA", and "GAC" were predicted to have a modest effect when inserted into most regions except near the start codon (Fig. 5g) . We sought to compare our predicted codon effects to existing measures of codon influence, such as the aforementioned CSCs [26] [27] [28] [29] [30] . We therefore investigated the relationship between the mean codon influence across an ORF, as predicted by Saluki, relative to the mean CSC across numerous cell types as quantified previously [26] .

Reassuringly, there was a strong relationship between the two metrics (Pearson correlation = 0.68, Fig. 5g (Fig. 6a) . We observed a general qualitative agreement between model predictions and experiment in the 3′-UTR regions showing high activity, which typically overlapped conserved and ARE-containing regions (Fig. 6a) . However, there were several novel elements detected by the assay which were conserved but not predicted by the model, indicating several false negatives among the predictions. Next, we evaluated how well our predictions agree with saturation mutagenesis data from the tested CXCL2 3′-UTR region. To more closely simulate this experiment in silico, we integrated each of the measured oligonucleotide fragments into the 3′ UTR of eGFP within the corresponding BTV vector tested in the experiment [86] . Then, for each mutation, we calculated the predicted variant effect as the divergence of predicted half-lives between the reference and alternative sequence. We observed a strong agreement between the observed and predicted variant effects, in which both methods highlighted a strong effect for a set of upstream AREs and a weaker effect for a downstream ARE (Fig. 6b) . Collectively, the variant effect predictions agreed with those observed with a Spearman correlation of 0.69 (Fig. 6c) . Finally, we assessed how well we could predict the observations of an MPRA testing the effect of 3,000 highly conserved 3′-UTR segments on RNA stability. We achieved a Spearman correlation of 0.63 between our model predictions and experiment (Fig. 6d) , suggesting that our model is able to integrate the causal relationship between RNA sequence and RNA stability. 

Scatter plot of the observed and predicted 3′-UTR effects for each of 3,000 conserved 3′ UTRs profiled by fastUTR [86] . e) Scatter plot of the observed and predicted variant effects, as measured in Beas2B cells [87] . Also indicated are the Pearson (r) and Spearman (rho) correlation values for panels (c-e).

To further evaluate the generality of our model, we turned to other MPRAs testing both 3′-UTR and variant effects in Jurkat and Beas2B cells; the elements tested were heavily enriched in ARE-containing regions [87] . We observed Spearman correlations of 0.42 and 0.49 between our model predictions and observed 3′-UTR effects in Jurkat and Beas2B cells, respectively (Additional file 1: Fig. S7a,b) . These predictions were similar to the observed 3′-UTR effects between both cell types (Spearman correlation = 0.52, Additional file 1: Fig. S7c) , suggesting Fig. S7f) .

As a final test for the model, we evaluated its agreement with MPRAs that measured the functional effect of 12,173 3′ UTRs in six cell types [88] . While the observed values agreed well between any pair of cell types (Spearman correlations from 0.60 to 0.83), our predictions agreed a bit more modestly (Spearman correlations from 0.26 to 0.50, Additional file 1: Fig. S7f) ,

suggesting that they partially captured the causal factors linking RNA sequence to stability in this dataset.

Since the emergence of the first transcriptome-wide measurements, there have been numerous efforts to understand the sequence-encoded determinants of mRNA half-life [6] [7] [8] [38] [39] [40] 90, 91] .

Despite the diversity of modeling approaches developed to predict half-life, the quality of the predictive engines has been severely limited by the quality of the measurement itself. It has long been recognized that different methodologies suffer from biased measurement [16, 17] , leading to an inconsistent portrait of transcriptome-wide half-lives. Indeed, nearly three decades ago, it was theorized that considering an ensemble of different methods would ultimately lead to a more precise definition of half-life [18] . In this study, we aimed to establish a more precise "ground truth" of mammalian mRNA half-lives, empowered by the systematic collection and subsequent meta-analysis of a large cohort of mammalian half-life datasets. As anticipated, this exercise revealed inconsistencies among data derived from different research groups and experimental protocols (e.g., pulse-chase vs transcriptional shutoff). We therefore derived consensus measurements of mRNA half-life which were less susceptible to technical noise and methodological bias.

Importantly, these bias-adjusted measurements led to several favorable outcomes. First, there was an improved interspecies agreement between the half-lives of orthologous human and mouse genes, indicating a greater conservation of this molecular trait than previously appreciated. Next, there was a substantial improvement in quantitative efforts to predict half-life from genetic and biochemical features. Finally, the relative importance of the heterogeneous mechanisms influencing half-life could be better delineated. Our lasso regression and deep learning models captured most of the known determinants of half-life, integrating the influence of AREs, PUF elements, YTHDF1-binding elements, splice sites, microRNA binding sites, and codon composition. While our lasso models implicate novel roles for several additional RBPs, our deep learning models propose novel properties of gene regulation such as the position-dependent effects of codons, motifs, and splice sites.

Aside from its improved accuracy, one of the key advantages of our deep learning model relative to our lasso regression models is its ability to rapidly generate predictions given arbitrary sequences, obviating the need to compute hundreds to thousands of features for each mRNA. 

In this study, we demonstrated our ability to predict mRNA half-life from sequence with an r 2 of 0.59, which represents nearly a 50% performance improvement relative to existing models in mammals (r 2 =0.20 and r 2 =0.39) [6, 7] . Among the biggest reasons for this improvement is simply the derivation of a more precise measurement of mRNA half-life from a meta-analysis of transcriptome-wide half-life datasets. Given our encouraging results in the auxiliary tasks of sequence and variant effect prediction, as quantified by the consistency between model predictions and MPRA data, we foresee several practical applications for Saluki.

First, there is a critical need to bioengineer stable RNAs for vaccine development and gene therapy applications. Despite the recent success of mRNA vaccines targeting Covid-19, they are inherently unstable and prone to degradation due to the prevalence of intracellular and extracellular RNases [92] . Problems surrounding mRNA instability and inefficient protein expression have been partially overcome through the incorporation of stabilizing mRNA structures [15] , a 5'-cap, modified nucleosides, and codon optimality rules [92] . In the context of adeno-associated virus (AAV)-based gene therapy vectors, codon optimality rules have also been engineered to encourage greater mRNA stabilities and higher translation rates for several therapeutic proteins [93] . Nevertheless, the global sequence has not yet been jointly optimized for the mRNA to exhibit a longer half-life. Deep learning models have been successfully coupled to generative models, such as generative adversarial networks, to engineer nucleic acids to obey certain properties. This is exemplified by the design of 5′ UTRs to improve translation rate [94] , 3′ UTRs to enhance cleavage and polyadenylation [95, 96] , and coding sequences to evolve antimicrobial peptides [97] . Similarly, we envision Saluki's use as an external function analyzer, or "oracle", to engineer mRNAs with desired half-life properties.

Given the widespread roles of mRNA decay in the genetics of human health and disease [12, 88, 98] , Saluki is also poised to provide insight into whether a genetic variant functions through post-transcriptional gene regulatory mechanisms. Towards this goal, it could help to fine-map causal eQTLs through the integration of its score into supervised learning frameworks [13] . Such scores could also be integrated into tools intended to identify pathogenic non-coding variants in the human genome [14] to further unravel the genetic basis of disease.

We manually collected the processed half-life values from all of the studies indicated ( Table 1 ).

In cases in which genes were provided as gene names or RefSeq IDs, the names were converted to their corresponding Ensembl IDs using information from the Ensembl BioMart. Half-life measurements were log-transformed after adding a pseudocount of 0.1 or 1, depending upon whether the unit of the provided half-life was measured in hours or minutes, respectively.

Duplicated IDs were then averaged to compute a single half-life measurement for each gene.

Five human datasets from four studies provided data as degradation rates rather than half-lives [29, 38, 46, 49] , so their transformed half-life values were negated. Finally, we generated a sparse matrix of half-lives for all genes x samples, containing missing values for genes in which the sample did not provide a measured half-life (Additional file 2: Dataset S1).

To evaluate the relatedness between samples, we first extracted the subset of genes in the aforementioned matrix for which ≥10 human samples (or ≥5 mouse samples) reported non-missing values. We then z-score transformed the half-lives from each sample to standardize the scale of each sample. To impute the missing values of the matrix, we used the estim_ncpPCA function in the missMDA R package to estimate the appropriate number of principal components (PC) for the imputePCA function, considering between 0 to 20 components for the human samples (or 0 to 10 components for the mouse samples). Finally, the data was quantile normalized using the normalize.quantiles function in the preprocessCore R package. The first PC of this imputed matrix (as computed by the prcomp function in R) was used as a robust cell-type-independent measurement of mRNA half-life, and computed for human and mouse species separately. For human, the samples from a single study [44] were removed prior to computing the PC because the samples from this study represented large outliers relative to all other samples (Additional file 3: Dataset S2).

For evolutionary comparisons between species, one-to-one human-to-mouse orthologs were acquired from the Ensembl v90 BioMart [99] by extracting the "Mouse gene stable ID" and "Mouse homology type" with respect to each human gene.

To examine sample relatedness, the first two PCs were computed using the transpose of the imputed (gene x sample) matrix, and samples were annotated and colored by their cell type of origin, study of origin, and half-life measurement technique.

Gene annotations for protein coding genes were derived from Ensembl v83 (hg38 genome build) and v90 (mm10 genome build) for human and mouse, respectively [99] . Only protein-coding genes were carried forward for analysis. Out of all transcripts corresponding to each gene, the one with the longest ORF, followed by the longest 5′ UTR, followed by the longest 3′ UTR was chosen as the representative transcript for that gene [2, 24, 25] .

Basic mRNA features to predict half-life

The G/C content and lengths of each of these functional regions (i.e., 5′ UTRs, ORFs, and 3′

UTRs), intron length, and ORF exon junction density (computed as the number of exon junctions per kilobase of ORF sequence) were gathered as "basic" mRNA features associated with mRNA half-life [2, 6, 7] . All length-related features were transformed such that: ← log 10 (x + 0.1) tô reduce the right skew [2] .

Sequence features to predict mRNA half-life

All genetically encoded features are summarized in Table 2 . Codon frequencies were extracted from ORF sequences using the oligonucleotideFrequency function (parameters width=3, step=3)

in the Biostrings R package, and normalizing the counts for each codon by the sum of all codon counts for the gene. K-mer frequencies were computed similarly from each of the 5′ UTR, ORF, and 3′ UTR sequences (parameters width={1..7} depending on the size of the k-mer, step=1).

MicroRNA features were collected for mammalian-conserved miRNA families using TargetScanHuman7.2 and TargetScanMouse7.2 [24] . We computed a miRNA target binding score by negating the "cumulative weighted context++ score" for each miRNA family.

The degree of predicted binding to a number of RBPs was computed for 5′ UTR, ORF, and 3′

UTR sequences separately using SeqWeaver [68] and DeepRiPE [69] . For SeqWeaver, we generated predictions on each 50 nt window, padding the sequence with its neighboring 475 nt upstream and downstream sequence (or Ns in the case of sequences at the boundary of a region)

to generate a 1000 nt input sequence. For DeepRiPE, we generated predictions on each 50 nt window, padding the sequence with its neighboring 50 nt upstream and downstream sequence (or Ns in the case of sequences at the boundary of a region) to generate a 150 nt input sequence.

DeepRiPE additionally required information about the functional region of interest, which we provided according to the functional region being considered. After generating predictions for all human and mouse RBPs for each functional region and gene, we computed the average value of the binding by normalizing the sum of values for each predicted RBP to the total length of the functional region.

All biochemical features are summarized in Table 2 . The number of CLIP peaks associated with each gene for each of 133 RBPs was downloaded from the ENCORI database [71] . eCLIP peaks were collected from the ENCORE browser as narrowPeak BED files [72] . The peaks for each RBP were intersected with gene body annotations and counted for each gene using "bedtools intersect" (parameters -s -c) [100] . PAR-CLIP peaks were downloaded from a previous study [70] and processed similarly to count the total peaks overlapping the gene body. m6A pathway (i.e., a6A, m6Am, YTHDF2, METTL3, METTL14, and WTAP) CLIP peaks were collected from an assortment of previous studies [37, 42, [73] [74] [75] [76] [77] , and if not already provided, the number of peaks intersecting gene bodies was computed. For all CLIP assays, genes with missing values (i.e., those without an annotated peak) were considered to have zero peaks. All peak counts were transformed as such: ← log 10 (x + 1) to reduce the right skew of the distributions.

Processed RIP-seq and translational efficiency data was downloaded from previous studies [36, [78] [79] [80] . Merging these values with half-lives resulted in missing values; these were imputed using the impute function of the imputeR R package (parameter lmFun="plsR") when considered alongside the "basic" continuous features describing mRNA properties.

Lasso regression modeling We trained a hybrid convolutional and recurrent deep neural network to predict half-life from its spliced mRNA sequence with several important gene structure annotations. Following previous work on applying such models to DNA/RNA sequence, we one-hot encoded the nucleotide sequence to four input tracks. Due to the strong influence of splicing on RNA stability, we added a fifth binary track to mark the positions of exon junctions at each exon's 5′ nucleotide. Due to the strong influence of mRNA codon composition, we added a sixth binary track to mark the beginning nucleotide of each codon, which implicitly labels the 5′ and 3′ UTRs due to their absence of codon markers. Although a nucleotide-only strategy may be preferable for mutation effect prediction, these mRNA features are important and could not easily be predicted by the model without adding substantial auxiliary training information.

Because mRNA lengths vary by orders of magnitude, we designed a model architecture to work with variable length sequences. We capped the length of an input mRNA to the model at 12,288 nt for practical purposes, finding that consideration of longer sequences led to equal performance at the cost of slower training speed. To accommodate the rare scenario in which an mRNA exceeded 12,288 nt, we truncated such cases from the 5′ end, restricting the model to access the 3′-most 12,288 nt. Conversely, for the common scenario in which an mRNA was shorter, we padded such cases with zeros at their 3′ ends, representing "N"s. We made use of the ten folds of human genes described above, and divided the mouse genes into ten folds that maintain the closest homologous genes in the same fold across species based on Ensembl annotations [99] .

Our model architecture, which we refer to as Saluki, consists of a "tower" of six convolution blocks to reach a resolution for which each position represents 128 nt ( To make a single numeric prediction for each sequence, we must aggregate information across the variable lengths. To achieve this, we use a common recurrent neural network block called the Gated Recurrent Unit (GRU) [102] . After layer normalization and ReLU activation, the GRU runs backward from the often-padded 3′ end to the information dense 5′ end. We take the final GRU hidden representation from the most 5′ position as a summary of the entire sequence. We apply a subsequent dense block, consisting of batch normalization, ReLU, and a dense layer.

Finally, we apply one more block of batch normalization, ReLU, and a final dense transformation to produce the half-life prediction.

We trained with the MSE loss function using the Adam optimizer on batches of 64 examples and learning rate 0.001, beta1 0.9, and beta2 0.98. We clipped gradients to a global norm of 0.5. We We trained models using both the human and mouse data using a published approach [5] , in which all parameters are shared except for two separate final dense blocks for human and mouse.

During training, we iterate between interwoven batches of human and mouse genes.

Our ten folds of genes allowed us to train multiple models, in which each fold was held out as a test set (and another was held out as a validation set 

We thank Han Yuan for help with TF-MoDISco analysis. We also thank Han Yuan, Divyanshi Srivastava, Nimrod Rubenstein, Charles Ledogar, and David Botstein for critical feedback.

Additional file 1: Figures S1-S7 . Supplementary figures S1 to S7.

Additional file 2: Dataset S1. Sparse matrices of human and mouse half-lives collected for each sample, provided along with the corresponding Ensembl gene IDs and gene names. Missing values exist for mRNAs whose half-lives were not measured or provided. Barplot of the number of genes whose half-lives were reported in each of 54 human samples. Each sample ID is delimited by underscores and listed according to its study of origin (Table 1) , measurement method, cell type, and replicate number. Figure S2 . Comparison of half-lives in a compendium of mouse datasets. a) Barplot of the number of genes whose half-lives were reported in each of 27 mouse samples. Each sample ID is delimited by underscores and listed according to its study of origin (Table 1) , measurement method, cell type, and replicate number. b). Heatmap of the absolute value of the Spearman correlations measured between half-lives derived from each pair of 27 mouse samples. Samples are clustered using hierarchical clustering according to the indicated dendrogram. Rows are labeled by the study of origin ( Table 1) and colored by the cell type of origin and measurement approach. Figure S4 . Prediction of mouse half-lives using sequence-encoded features. a-d) These panels are organized in the same fashion as Fig. 3a -d, except that they evaluate features benchmarked upon mouse data rather than human data. e) Performance of trained human and mouse models on test sets derived from either the same or different species. Due to the imperfect one-to-one miRNA mappings between the human and mouse, miRNA-related coefficients were excluded from the inter-species predictions. Each statistical comparison shown was evaluated with a paired t-test.

This figure is organized in the same fashion as Fig. 3-4 , except that it evaluates the subset of optimal genetic features (BC3MS model, Fig. 3) , optimal biochemical features (BEeM model, Fig. 4) , or a combination of both genetic and biochemical features. The features shown in panels (c-d) are described alongside their corresponding source, such as eCLIP, SeqWeaver, or codon features. Performance of trained Saluki models on each of 10 held-out folds of data. Compared is the relative performance between pairs of models, for both human and mouse species, which iteratively consider additional input tracks. Each model is described by a code indicating the input track considered. A description of the code is provided in the key. An improvement in a more complex model relative to a simpler model was evaluated with a one-sided, paired t-test, adjusted with a Bonferroni correction to account for the total number of hypothesis tests. c) This panel is organized in the same fashion as Fig. 5c , except that it evaluates the performance on mouse data rather than human data. d) Set of enriched motifs discovered by TF-MoDISco [83] in each of the three functional regions of an mRNA. Figure S7 . Concordance of Saluki predictions and additional functional data from massively parallel reporter assays. a-c) Scatter plot of the observed and predicted 3′-UTR effects, as measured in a) Jurkat cells or b) Beas2B cells, alongside c) a plot of observed 3′-UTR effects between the pair of cell types [87] . d-f) These panels are organized like panels (a-c), but display variant effects instead of 3′-UTR effects [87] . Panel e) is identical to that shown in Fig. 6e and reproduced here for convenience. Also indicated are the Pearson (r) and Spearman (rho) correlation values for panels (a-f). g) Scatter matrix displaying scatter plots corresponding to each of the 21 pairs of possible comparisons (lower diagonal elements) involving Saluki predictions as well as the measured stability effects for 3′-UTR fragments measured in each of six cell types [88] . Shown on the diagonal is a histogram of the predicted or observed RNA half-life scores. Also shown are Spearman correlation values among each pair of comparisons, with the size of the text proportional to the magnitude of the correlation coefficient (upper diagonal elements).

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Effective gene expression prediction from sequence by integrating long-range interactions

Cross-species regulatory sequence activity prediction

Database for mRNA half-life of 19 977 genes obtained by DNA microarray analysis of pluripotent and differentiating mouse embryonic stem cells

UTR-isoform choice has limited influence on the stability and translational efficiency of most mRNAs in mouse fibroblasts

Cis-regulatory elements explain most of the mRNA stability variation across genes in yeast

Most mammalian mRNAs are conserved targets of microRNAs

A high-resolution map of human evolutionary constraint using 29 mammals

Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals

The contribution of RNA decay quantitative trait loci to inter-individual variation in steady-state gene expression levels

Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs

CADD: predicting the deleteriousness of variants throughout the human genome

Combinatorial optimization of mRNA structure, stability, and translation for RNA-based therapeutics. bioRxiv

mRNA stability in mammalian cells

Transcriptional pulsing approaches for analysis of mRNA turnover in mammalian cells

A comparison of apparent mRNA half-life using kinetic labeling techniques vs decay following administration of transcriptional inhibitors

Dynamic transcriptome analysis measures rates of mRNA synthesis and decay in yeast

TU-tagging: cell type-specific RNA isolation from intact complex tissues

Comparative dynamic transcriptome analysis (cDTA) reveals mutual feedback between mRNA synthesis and degradation

GC content shapes mRNA storage and decay in human cells

Predicting effective microRNA target sites in mammalian mRNAs

Predicting microRNA targeting efficacy in Drosophila

Codon and amino acid content are associated with mRNA stability in mammalian cells

Coding regions affect mRNA stability in human cells

Codon optimality is a major determinant of mRNA stability

Translation affects mRNA stability in a codon-dependent manner in human cells

Codon bias confers stability to human mRNAs

mRNA structure regulates protein expression through changes in functional half-life

Widespread Influence of 3'-End Structures on Mammalian mRNA Processing and Stability

Human Pumilio proteins recruit multiple deadenylases to efficiently repress messenger RNAs

Structural basis for the recruitment of the human CCR4-NOT deadenylase complex by tristetraprolin

YTHDF2 destabilizes m6A-containing RNA through direct recruitment of the CCR4-NOT deadenylase complex [Internet]. Nature Communications

N6-methyladenosine-dependent regulation of messenger RNA stability

A Unified Model for the Function of YTHDF Proteins in Regulating m6A-Modified mRNA

Decay rates of human mRNAs: correlation with functional characteristics and sequence attributes

Non-invasive measurement of mRNA decay reveals translation initiation as the major determinant of mRNA stability

Characterizing RNA stability genome-wide through combined analysis of PRO-seq and RNA-seq data

DRUID: a pipeline for transcriptome-wide measurements of mRNA stability

Reversible methylation of m6Am in the 5' cap controls mRNA stability

Global donor and acceptor splicing site kinetics in human cells

Genome-wide survey of interindividual differences of RNA stability in human lymphoblastoid cell lines

Genome-wide determination of RNA stability reveals hundreds of short-lived noncoding transcripts in mammals

Metabolic labeling of RNA using multiple ribonucleoside analogs enables the simultaneous evaluation of RNA synthesis and degradation rates

TT-seq maps the human transient transcriptome

Differential protein occupancy profiling of the mRNA transcriptome

mRNA turnover rate limits siRNA and microRNA efficacy

Long-TUC-seq is a robust method for quantification of metabolically labeled full-length isoforms

Acetylation of Cytidine in mRNA

Chromatin environment, transcriptional regulation, and splicing distinguish lincRNAs and mRNAs

Sci-fate characterizes the dynamics of gene expression in single cells

TimeLapse-seq: adding a temporal dimension to RNA sequencing through nucleoside recoding

Conserved principles of mammalian transcriptional regulation revealed by RNA half-life

Thiol-linked alkylation of RNA to assess expression dynamics

The Dynamics of Cytoplasmic mRNA Metabolism

m6A mRNA modifications are deposited in nascent pre-mRNA and are not required for splicing but do specify cytoplasmic turnover

scSLAM-seq reveals core features of transcription dynamics in single cells

Stem cells. m6A mRNA methylation facilitates resolution of naïve pluripotency toward differentiation

CNOT3-Dependent mRNA Deadenylation Safeguards the Pluripotent State

High-resolution gene expression profiling for simultaneous kinetic parameter analysis of RNA synthesis and decay

Metabolic labeling of RNA uncovers principles of RNA production and degradation dynamics in mammalian cells

High-resolution sequencing and modeling identifies distinct dynamic RNA regulatory strategies

Global quantification of mammalian gene expression control

Systematic analysis of cis-elements in unstable mRNAs demonstrates that CUGBP1 is a key regulator of mRNA decay in muscle cells

Global analyses of the effect of different cellular contexts on microRNA targeting

Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk

Deep neural networks for interpreting RNA-binding protein target preferences

Deciphering human ribonucleoprotein regulatory networks

0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data

A large-scale binding and functional map of human RNA-binding proteins

Endoribonucleolytic Cleavage of m6A-Containing RNAs by RNase P/MRP Complex

A METTL3-METTL14 complex mediates mammalian nuclear RNA N6-adenosine methylation

A metabolic labeling method detects m6A transcriptome-wide at single base resolution

Perturbation of m6A writers reveals two distinct classes of mRNA methylation at internal and 5' sites

m6A RNA Modification Controls Cell Fate Transition in Mammalian Embryonic Stem Cells

Widespread RNA binding by chromatin-associated proteins

The influence of microRNAs and poly(A) tail length on endogenous mRNA-protein complexes

Genome Biology

N(6)-methyladenosine Modulates Messenger RNA Translation Efficiency

The landscape of alternative polyadenylation in single cells of the developing mouse embryo

Signals for pre-mRNA cleavage and polyadenylation

Technical Note on Transcription Factor Motif Discovery from Importance Scores (TF-MoDISco)

Learning the regulatory code of the accessible genome with deep convolutional neural networks

MicroRNA targeting specificity in mammals: determinants beyond seed pairing

Massively parallel functional annotation of 3' untranslated regions

Massively parallel analysis of human 3′ UTRs reveals that AU-rich element length and registration predict mRNA destabilization

Oxford Academic

Genome-wide functional screen of 3′ UTR variants uncovers causal variants for human disease and evolution

The UCSC Genome Browser database: 2021 update

Genomic-scale measurement of mRNA turnover and the mechanisms of action of the anti-cancer drug flavopiridol

Genome-wide analysis of mRNA decay in resting and activated primary human T lymphocytes

mRNA vaccines for COVID-19: what, why and how

Emerging Issues in AAV-Mediated In Vivo Gene Therapy

Human 5′ UTR design and variant effect prediction from a massively parallel translation assay

A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences

A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation

Feedback GAN for DNA optimizes protein functions

MRNA stability and the control of gene expression: implications for human disease

BEDTools: a flexible suite of utilities for comparing genomic features

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling