RegTools: Integrated analysis of genomic and transcriptomic data for the discovery of splicing variants in cancer


1 

RegTools: Integrated analysis of genomic and transcriptomic data for the discovery of 

splicing variants in cancer 

 
Kelsy C. Cotto1,2,†, Yang-Yang Feng2,†, Avinash Ramu3, Zachary L. Skidmore1,2, Jason 
Kunisaki2, Megan Richters1,2, Sharon Freshour1,2, Yiing Lin4, William C. Chapman4, Ravindra 
Uppaluri5,6, Ramaswamy Govindan1,7, Obi L. Griffith1,2,3,7*, Malachi Griffith1,2,3,7* 
 
† denotes co-first authors. 
* denotes corresponding authors. 
 
Correspondence to Obi L. Griffith (obigriffith@wustl.edu) and Malachi Griffith 
(mgriffit@wustl.edu). 
 
Affiliations:  
1. Division of Oncology, Department of Medicine, Washington University School of Medicine, 

St. Louis, MO, USA 
2. McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, 

USA 
3. Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA 
4. Department of Surgery, Washington University School of Medicine, St. Louis, MO, USA 
5. Department of Surgery, Brigham and Women’s Hospital, Boston, MA, USA 
6. Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA 
7. Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, USA 
 
Abstract 
 
Somatic mutations in non-coding regions and even in exons may have unidentified regulatory 
consequences which are often overlooked in analysis workflows. Here we present RegTools 
(www.regtools.org), a free, open-source software package designed to integrate analysis of 
somatic variants from genomic data with splice junctions from transcriptomic data to identify 
variants that may cause aberrant splicing. RegTools was applied to over 9,000 tumor samples 
with both tumor DNA and RNA sequence data. We discovered 235,778 events where a variant 
significantly increased the splicing of a particular junction, across 158,200 unique variants and 
131,212 unique junctions. To characterize these somatic variants and their associated splice 
isoforms, we annotated them with the Variant Effect Predictor (VEP), SpliceAI, and Genotype-
Tissue Expression (GTEx) junction counts and compared our results to other tools that integrate 
genomic and transcriptomic data. While certain events can be identified by the aforementioned 
tools, the unbiased nature of RegTools has allowed us to identify novel splice variants and 
previously unreported patterns of splicing disruption in known cancer drivers, such as TP53, 
CDKN2A, and B2M, as well as in genes not previously considered cancer-relevant, such as 
RNF145.  
 
Introduction 
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


2 

Alternative splicing of messenger RNA allows a single gene to encode multiple gene products, 
increasing a cell’s functional diversity and regulatory precision. However, splicing malfunction 
can lead to imbalances in transcriptional output or even the presence of novel oncogenic 
transcripts1. The interpretation of variants in cancer is frequently focused on direct protein-
coding alterations2. However, most somatic mutations arise in intronic and intergenic regions, 
and exonic mutations may also have unidentified regulatory consequences3,4,5,6. For example, 
mutations can affect splicing either in trans, by acting on splicing effectors, or in cis, by altering 
the splicing signals located on the transcripts themselves7.  
 
Increasingly, we are identifying the importance of splice variants in disease processes, including 
in cancer8,9. However, our understanding of the landscape of these variants is currently limited, 
and few tools exist for their discovery. One approach to elucidating the role of splice variants 
has been to predict the strength of putative splice sites in pre-mRNA from genomic sequences, 
such as the method used by the SpliceAI tool10–13. With the advent of efficient and affordable 
RNA-seq, we are also seeing the complementary approach of evaluating alternative splicing 
events (ASEs) directly from RNA sequencing data. Various tools exist which allow the 
identification of significant ASEs from transcript-level data within sample cohorts, including 
SUPPA2 and SPLADDER14,15. Many of these tools have also evaluated the role of trans-acting 
splice mutations16. However, few tools are directed at linking specific aberrant RNA splicing 
events to specific genomic variants in cis to investigate the splice regulatory impact of these 
variants. Those few relevant tools that do exist have significant limitations that preclude them 
from broad applications. The sQTL-based approach taken by LeafCutter and other tools is 
designed for relatively frequent single-nucleotide polymorphisms. It is thus ill-suited to studying 
somatic variants, or any case in which the frequency of a particular variant is very low (often 
unique) in a given sample population17–19. Recent tools that have been created for large-scale 
analysis of cancer-specific data, such as MiSplice and Veridical, ignore certain types of ASEs, 
are tailored to specific analysis strategies and sets of hypotheses, or are otherwise inaccessible 
to the end-user due to issues such as lack of documentation, difficulty with installation and 
integration with existing pipelines, limited computing efficiency,  or licensing issues20–22. To 
address these needs, we have developed RegTools, a free, open-source (MIT license) software 
package that is well-documented, modularized for ease of use, and designed to efficiently 
identify potential cis-acting splice-relevant variants in tumors (www.regtools.org).  
 
RegTools is a suite of tools designed to aid users in a broad range of splicing-related analyses. 
At the highest level, it contains three sub-modules: a variants module to annotate variant calls 
with respect to their potential splicing relevance, a junctions module to analyze aligned RNA-seq 
data and associated splicing events, and a cis-splice-effects module that integrates genomic 
variant calls and transcriptomic sequencing data to identify potential splice-altering variants. 
Each sub-module contains one or more commands, which can be used individually or integrated 
into regulatory variant analysis pipelines.  
 
To demonstrate the utility of RegTools in identifying potential splice-relevant variants from tumor 
data, we analyzed a combination of data available from the McDonnell Genome Institute (MGI) 
at Washington University School of Medicine and The Cancer Genome Atlas (TCGA) project. In 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


3 

total, we applied RegTools to 9,173 samples across 35 cancer types. We contrasted our results 
with other tools that integrate genomic and transcriptomic data to identify potential splice 
altering variants, specifically Veridical, MiSplice, and SAVNet20,21,23. Novel junctions identified by 
RegTools were compared to data from The Genotype-Tissue Expression (GTEx) project to 
assess whether these junctions are present in normal tissues24. Variants significantly associated 
with novel junctions were processed through VEP and Illumina’s SpliceAI tool to compare our 
findings with splicing consequences predicted based on the variant information alone13,25. With 
this additional analysis, we were able to more easily identify both variants in known cancer 
drivers, whose splicing consequences have not been previously reported in the literature, and 
potentially novel cancer drivers, whose disruption relies on splice-altering mutations 
 
Results 
 
The RegTools tool suite supports splice regulatory variant discovery by the integration 
of genome and transcriptome data. 
 
RegTools is a suite of tools designed to aid users in a broad range of splicing-related analyses. 
The variants module contains the annotate command. The variants annotate command takes a 
VCF of somatic variant calls and a GTF of transcriptome annotations as input. RegTools does 
not have any particular preference for variant callers or reference annotations. Each variant is 
annotated by RegTools with known overlapping genes and transcripts, and is categorized into 
one of several user-configurable “variant types”, based on position relative to the edges of 
known exons.  The variant type annotation depends on the stringency for splicing-relevance that 
the user sets with the “splice variant window” setting. By default, RegTools marks intronic 
variants within 2 bp of the exon edge as “splicing intronic”, exonic variants within 3 bp as 
“splicing exonic”, other intronic variants as “intronic”, and other exonic variants simply as 
“exonic.” RegTools considers only “splicing intronic” and “splicing exonic” as important. To allow 
for discovery of an arbitrarily expansive set of variants, RegTools allows the user to customize 
the size of the exonic/intronic windows individually (e.g. -i 50 -e 5 for intronic variants 50 bp from 
an exon edge and exonic variants 5 bp from an exon edge) or even consider all exonic/intronic 
variants as potentially splicing-relevant (e.g. -E or -I) (Figure 1A).  
 
The junctions module contains the extract and annotate commands. The junctions extract 
command takes an alignment file containing aligned RNA-seq reads, infers the exon-exon 
boundaries based on the CIGAR strings26, and outputs each “junction” as a feature in BED12 
format. The junctions annotate command takes a file of junctions in BED12 format (such as the 
one output by junctions extract), a FASTA file containing the reference genome, and a GTF file 
containing reference transcriptome annotations and generates a TSV file, annotating each 
junction with: the number of acceptor sites, donor sites, and exons skipped, and the identities of 
known overlapping transcripts and genes. We also annotate the “junction type”, which denotes if 
and how the junction is novel (i.e. different compared to provided transcript annotations). If the 
donor is known, but the acceptor is not or vice-versa, it is marked as “D” or “A”, respectively. If 
both are known, but the pairing is not known, it is marked as “NDA”, whereas if both are 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


4 

unknown, it is marked as “N”. If the junction is not novel (i.e. it appears in at least one transcript 
in the supplied GTF), it is marked as “DA” (Figure 1B).  
 
The cis-splice-effects module contains the identify command, which identifies potential splice-
altering variants from sequencing data. The following are required as input: a VCF file 
containing variant calls, an alignment file containing aligned RNA-sequencing reads, a reference 
genome FASTA file, and a reference transcriptome GTF file. The identify pipeline internally 
relies on variants annotate, junctions extract, and junctions annotate to output a TSV containing 
junctions proximal to putatively splicing-relevant variants. The identify pipeline can be 
customized using the same parameters as in the individual commands. Briefly, cis-splice-effects 
identify first performs variants annotate to determine the splicing-relevance of each variant in 
the input VCF. For each variant, a “splice junction region” is determined by finding the largest 
span of sequence space between the exons that flank the exon associated with the variant. 
From here, junctions extract identifies splicing junctions present in the RNA-seq BAM. Next, 
junctions annotate labels each extracted junction with information from the reference 
transcriptome as described above and its associated variants based on splice junction region 
overlap (Figure 1C). 
 
For our analysis, we annotated the pairs of associated variants and junctions identified by 
RegTools, which we refer to as “events”, with additional information such as whether this 
association was identified by a comparable tool, the junction was found in GTEx, and whether 
the event occurred in a cancer gene according to Cancer Gene Census (CGC) (Figure 1C)24,27. 
Finally, we created IGV sessions for each event identified by RegTools that contained a bed file 
with the junction, a VCF file with the variant, and an alignment (BAM) file for each sample that 
contained the variant28. These IGV sessions were used to manually review candidate events to 
assess whether the association between the variant and junction makes sense in a biological 
context. 
 
RegTools is designed for broad applicability and computational efficiency. By relying on well-
established standards for sequence alignments, annotation files, and variant calls and by 
remaining agnostic to downstream statistical methods and comparisons, our tool can be applied 
to a broad set of scientific queries and datasets. Moreover, performance tests show that cis-
splice-effects identify can process a typical candidate variant list of 1,500,000 variants and a 
corresponding RNA-seq BAM file of 82,807,868 reads in just ~8 minutes (Supplementary 
Figure 1). 
 
Pan-cancer analysis of 35 tumor types identifies somatic variants that alter canonical 
splicing 
 
RegTools was applied to 9,173 samples over 35 cancer types. 32 of these cohorts came from 
TCGA while the remaining three were obtained from other projects being conducted at MGI. 
Cohort sizes ranged from 21 to 1,022 samples. In total, 6,370,631 variants (Figure 2A) and 
2,387,989,201 junction observations (Figure 2B) were analyzed by RegTools. By comparing 
the number of initial variants per cohort to the number of statistically significant variants, we 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


5 

were able to show that RegTools produces a prioritized list of potential splice relevant variants 
(Supplementary Figure 2). Additionally, when analyzing the junctions within each sample, we 
found that junctions present in the reference transcriptome are frequently seen within GTEx 
data while junctions observed from a sample’s own transcriptome data that were not present in 
the reference are rarely seen within GTEx (Supplementary Figure 3). 235,778 significant 
variant junction pairings were found for junctions that use a known donor and novel acceptor 
(D), novel donor and known acceptor (A), or novel combination of a known donor and a known 
acceptor (NDA), with novel here meaning that the junction was not found in the reference 
transcriptome (Methods, Figure 2C, Supplemental Files 1 and 2). While our analysis primarily 
focuses on variants in relation to novel splice events because of the potential importance of 
these events within tumor processes, we also wanted to assess how often a variant was 
significantly associated with a known junction. 5,157 variant junction pairings were found for 
junctions known to the reference (DA junctions) (Supplemental Files 3 and 4). This finding 
indicates that while splice variants usually result in a novel junction occurring, they sometimes 
alter the expression of known junctions. Generally, significant events were evenly split among 
each of the novel junction types considered (D, A, and NDA). The number of significant events 
increased as the splice variant window size increased, with both the E and I results being 
comparable in number. Notably, hepatocellular carcinoma (HCC) was the only cohort that had 
whole genome sequencing (WGS) data available and, as expected, it exhibited a marked 
increase in the number of significant events for its results within the “I” splice variant window. 
This observation highlights the low sequence coverage of intronic regions that occurs with WES 
which subsequently leads to underpowered discovery of potential splice altering variants within 
introns. 
 
Variants were analyzed across tumor types for how often they result in either a single or multiple 
novel junctions (Figure 3A). While a single variant resulting in a single novel junction is most 
commonly observed (72.27-83.78%), a single variant also commonly results in multiple 
junctions being created, either of the same type (6.56-10.94%) or of different types (9.66-
16.79%) (Figure 3B). Variants that are associated with multiple novel junctions of different 
types were further investigated to identify how often a particular junction type occurred with 
another (Figure 3C). Most commonly, we observed an alternate donor or acceptor site being 
used in conjunction with an exon skipping event. These events were particularly common within 
the default window (2 intronic bases or 3 exonic bases from the exon edge), as a SNV or indel 
within these positions has a high probability of disrupting the natural splice site, thus causing the 
splicing machinery to use a cryptic splice site nearby or skip the splice site entirely. The next 
most common event was an alternate donor site and an alternate acceptor site both being used 
as the result of a single variant. The combination of a novel acceptor site and novel donor site 
being used in conjunction with an exon-skipping event occurred the least and occurrence of this 
type of event remains fairly low, even as the search space increases within the larger splice 
variant windows. This finding indicates the low likelihood of a single variant resulting in 
simultaneous disruption of a splice acceptor and donor as well as complete skipping of an exon. 
Overall, this analysis highlights that there is evidence that a single variant can lead to multiple 
novel junctions being expressed. Tools that only allow for a single junction to be predicted or 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


6 

associated with a variant therefore may not be completely describing the effect of the variant in 
question in up to ~27% of cases. 
 
RegTools identifies splice altering variants missed by other splice variant predictors and 
annotators 
 
To evaluate the performance of RegTools, we compared our results to those of SAVNet, 
MiSplice, Veridical, VEP, and SpliceAI13,20,21,23,25. These tools vary in their inputs and 
methodology for identifying splice altering variants (Figure 4A). Both VEP and SpliceAI only 
consider information about the variant and its genomic sequence context and do not consider 
information from a sample’s transcriptome. A variant is considered to be splice relevant 
according to VEP if it occurs within 1-3 bases on the exonic side or 1-8 bases on the intronic 
side of a splice site. SpliceAI does not have restrictions on where the variant can occur in 
relation to the splice site but by default, it predicts one new donor and acceptor site within 50 bp 
of the variant, based on reference transcript sequences from GENCODE. Like RegTools, 
SAVNet, MiSplice, and Veridical integrate genomic and transcriptomic data in order to identify 
splice altering variants. MiSplice only considers junctions that occur within 20 bp of the variant. 
Additionally, SAVNet, MiSplice, and Veridical filter out any transcripts found within the reference 
transcriptome. SAVNet, MiSplice, and Veridical employ different statistical methods for the 
identification of splice altering variants. In contrast to RegTools, none of the mentioned tools 
allow the user to set a custom window in which they wish to focus splice altering variant 
discovery (e.g. around the splice site, all exonic variants, etc.). These tools have different levels 
of code availability. MiSplice is available via GitHub as a collection of Perl scripts that are built to 
run via Load Sharing Facility (LSF) job scheduling. To run MiSplice without an LSF cluster, the 
authors mention code changes are required. Veridical is available via a subscription through 
CytoGnomix’s MutationForecaster. Similar to RegTools, SAVNet is available via GitHub or 
through a Docker image. However, SAVNet relies on splicing junction files generated by STAR29 
whereas RegTools can use RNA-Seq alignment files from HISAT230, TopHat231, or STAR, thus 
allowing it to be integrated into bioinformatic workflows more easily. 
 
In their recent publications, SAVNet23, MiSplice20, and Veridical21,22 also analyzed data from 
TCGA, with only minor differences in the number of samples included for each study. VEP and 
SpliceAI results were obtained by running each tool on all starting variants for the 35 cohorts 
included in this study. In order to efficiently compare this data, an UpSet plot (Figure 4B) was 
created32. Only 343 variants are identified as splice altering by all six tools. Comparatively, 
MiSplice and SAVNet find few splice altering variants, potentially indicating that these tools are 
overlooking the complete set of variants that have an effect on splicing. In contrast, Veridical 
identifies by far the most splice altering variants across all tools, with 94.54 percent of its calls 
being found by it alone. SpliceAI and VEP called a large number of variants, either alone or in 
agreement, that none of the tools that integrate transcriptomic data from samples identify. This 
highlights a limitation of using tools that only focus on genomic data, particularly in a disease 
context where transcripts are unlikely to have been annotated before. RegTools addresses 
these short-comings by identifying what pieces of information to extract from a sample’s 
genome and transcriptome in a very basic, unbiased way that allows for generalization. Other 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


7 

tools either only analyze genomic data, focus on junctions where either the canonical donor or 
acceptor site is affected (missing junctions that result from complete exon skipping), or consider 
only those variants within a very narrow distance from known splice sites. RegTools can include 
any kind of junction type, including exon-exon junctions that have ends that are not known 
donor/acceptor sites according to the GTF file (N junction according to RegTools), any distance 
size to make variant-junction associations, and any window size in which to consider variants. 
Due to these advantages, RegTools identified events missed by one or multiple of the tools to 
which we compared (Figure 4B; Supplementary Figures 4 and 5). 
 
Pan-cancer analysis reveals novel splicing patterns within known cancer genes and 
potential cancer drivers 
 
While efforts have been made to associate variants with specific cancer types, there has been 
little focus on identifying such associations in splice-altering variants, even those in known 
cancer genes. TP53 is a rare example whose splice-altering variants are well characterized in 
numerous cancer types33. As such, we further analyzed significant events to identify genes that 
had recurrent splice altering variants. Within each cohort, we looked for recurrent genes using 
two separate metrics: a binomial test p-value and the fraction of samples (see Methods). For 
ranking and selecting the most recurrent genes, each metric was computed by pooling across 
all cohorts. For assessing cancer-type specificity, each metric was then also computed using 
only results from a given cancer cohort. Since the mechanisms underlying the creation of novel 
junctions versus the disruption of existing splicing patterns may be different, analysis was 
performed separately for D/A/NDA junctions (Figure 5, Supplementary Figure 6, 
Supplementary File 5) and DA junctions (Supplementary Figure 7, Supplementary File 6), 
which allowed multiple test correction in accordance with the noise of the respective data. We 
identified 6,954 genes in which there was least one variant predicted to influence the splicing of 
a D/A/NDA junction. The 99th percentile of these genes, when ranked by either metric, are 
significantly enriched for known cancer genes, as annotated by the CGC (p=1.26E-19, ranked 
by binomial p-values, p=2.97E-24, ranked by fraction of samples; hypergeometric test). We also 
identified 3,643 genes in which there was least one variant predicted to influence the splicing of 
a DA (known) junction. The 99th percentile of these genes, when ranked by either metric, are 
also significantly enriched for known cancer genes, as annotated by the Cancer Gene Census 
(p=1.00E-04, ranked by binomial p-values, p=3.56E-07, ranked by fraction of samples; 
hypergeometric test). We also performed the same analyses using either the TCGA or MGI 
cohorts alone. The TCGA-only analyses gave very similar results to the combined analyses, 
with the 99th percentile of genes found in the D/A/NDA and DA analyses again being enriched 
for cancer genes (Supplementary Figures 8 and 9; Supplemental Files 5 and 6). Due to 
small cohort sizes, in the MGI-only analyses, we identified only 329 and 208 genes in the 
D/A/NDA and DA analyses, respectively. The 99th percentile of genes from these analyses, 
respectively, were not significantly enriched for cancer genes (Supplementary Figures 10 and 
11; Supplemental Files 5 and 6).  
 
When analyzing D, A, and NDA junctions, we saw an enrichment for known tumor suppressor 
genes among the most splice disrupted genes, including several examples where splice 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


8 

disruption is a known mechanism such as TP53, PTEN, CDKN2A, and RB1. Specifically, in the 
case of TP53, we identified 428 variants that were significantly associated with at least one 
novel splicing event. One such example is the intronic SNV (GRCh38, chr17:g.7673609C>A) 
that was identified in an OSCC sample and was associated with an exon skipping event and an 
alternate acceptor site usage event, with 23 and 41 reads of support, respectively 
(Supplemental Figure 12). The cancer types in which we find splice disruption of TP53 and 
other known cancer genes is in concordance with associations between genes and cancer 
types described by CGC and CHASMplus27,34. Our analysis’s recovery of known drivers, many 
of which with known susceptibilities to splicing dysregulation in cancer, indicates the ability of 
our method to identify true splicing effects that are likely cancer-relevant. 
 
Another cancer gene that we found to have a recurrence of splicing altering variants was B2M. 
Specifically, we identified six samples with intronic variants on either side of exon 2 (Figure 6). 
While mutations have been identified and studied within exon 2, we did not find literature that 
specifically identified intronic variants near exon 2 as a mechanism for disrupting B2M35. These 
mutations were identified by VEP to be either splice acceptor variant or a splice donor variant 
and were also identified by Veridical. MiSplice was able to predict one of the novel junctions for 
each variant but failed to predict additional novel junctions due to the limitation of that tool to 
only predict one novel acceptor and donor site per variant. Notably, 4 out of the 6 samples that 
these variants were found in are MSI-H (Microsatellite instability-high) tumors36. Mutations in 
B2M, particularly within colorectal MSI-H tumors, have been identified as a method for tumors to 
become incapable of HLA class I antigen-mediated presentation37. Furthermore, in a study of 
patients treated with immune checkpoint blockade (ICB) therapy, defects to B2M were observed 
in 29.4% of patients with progressing disease38. In the same study, B2M mutations were 
exclusively seen in pretreatment samples from patients who did not respond to ICB or in post-
progression samples after initial response to ICB38. There are several genes that are 
responsible for the processing, loading, and presentation of antigens, and have been shown to 
be mutated in cancers39. However, no proteins can be substituted for B2M in HLA class I 
presentation, thus making the loss of B2M a particularly robust method for ICB resistance40. We 
also observe exonic variants and variants further in intronic regions that disrupt canonical 
splicing of B2M. These findings indicate that intronic variants that result in alternative splice 
products within B2M may be a mechanism for immune escape within tumor samples. 
 
We also identify recurrent splice altering variants in genes not known to be cancer genes 
(according to CGC), such as RNF145. RegTools identified a recurrent single base pair deletion 
that results in an exon skipping event of exon 8 (Supplementary Figure 13). This gene is a 
paralog of RNF139, which has been found to be mutated in several cancer types41. This variant 
junction association was found in STAD, UCEC, COAD, and ESCA tumors, all of which are 
considered to be MSI-H tumors36. After analyzing the effect of the exon skipping event on the 
mRNA sequence, we concluded that the reading frame remains intact, possibly leading to a 
gain of function event. Additionally, the skipping of exon 8 leads to the removal of a 
transmembrane domain and a phosphorylation site, S352, which could be important for the 
regulation of this gene42. Based on these findings, RNF145 may play a role similar to RNF139 
and may be an important driver event in certain tumor samples. 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


9 

 
While most of our analysis focused on splice altering variants that resulted in D, A, NDA 
junctions, we also wanted to investigate variants that shifted the usage of known donor and 
acceptor sites. Through this analysis, we identified CDKN2A, a tumor suppressor gene that is 
frequently mutated in numerous cancers43, to have several variants that led to alternate donor 
usage (Supplementary Figure 14). When these variants are present, an alternate known donor 
site is used that leads to the formation of the transcript ENST00000579122.1 instead of 
ENST00000304494.9, the transcript that encodes for p16ink4a, a known tumor suppressor. The 
transcript that results from use of this alternate donor site is missing the last twenty-eight amino 
acids that form the C-terminal end of p16ink4a. Notably, this removes two phosphorylation sites 
within the p16 protein, S140 and S152, which when phosphorylated promotes the association of 
p16ink4a with CDK444. This finding highlights the importance of including known transcripts in 
alternative splicing analyses as variants may alter splice site usage in a way that results in a 
known but pathogenic transcript product. 
 
Discussion 
 
Splice associated variants are often overlooked in traditional genomic analysis. To address this 
limitation, we created RegTools, a software suite for the analysis of variants and junctions in a 
splicing context. By relying on well-established standards for analyzing genomic and 
transcriptomic data and allowing flexible analysis parameters, we enable users to apply 
RegTools to a wide set of scientific methodologies and datasets. To ease the use and 
integration of RegTools into analysis workflows, we provide documentation and example 
workflows via (regtools.org) and provide a Docker image with all necessary software installed.  
 
In order to demonstrate the utility of our tool, we applied RegTools to 9,173 tumor samples 
across 35 tumor types to profile the landscape of this category of variants. From this analysis, 
we report 133,987 variants that cause novel splicing events that were missed by VEP or 
SpliceAI. Only 1.4 percent of these mutations were previously discovered by similar attempts, 
while 98.6 percent are novel findings. We demonstrate that there are splice altering variants that 
occur beyond the splice site consensus sequence, shift transcript usage between known 
transcripts, and create novel exon-exon junctions that have not been previously described. 
Specifically, we describe notable findings within B2M, RNF145, and CDKN2A. These results 
demonstrate the utility of RegTools in discovering novel splice-altering mutations and confirm 
the importance of integrating RNA and DNA sequencing data in understanding the 
consequences of somatic mutations in cancer. To allow further investigation of these identified 
events, we make all of our annotated result files (Supplemental Files 1-4) and recurrence 
analysis files (Supplemental Files 5-6) available. 
 
Understanding the splicing landscape is crucial for unlocking potential therapeutic avenues in 
precision medicine and elucidating the basic mechanisms of splicing. The exploration of novel 
tumor-specific junctions will undoubtedly lead to translational applications, from discovering 
novel tumor drivers, diagnostic and prognostic biomarkers, and drug targets, to identifying a 
previously untapped source of neoantigens for personalized immunotherapy. While our analysis 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


10 

focuses on splice altering variants within cancers, we believe RegTools will play an important 
role in answering this broad range of questions by helping users extract splicing information 
from transcriptome data and linking it to somatic (or germline) variant calls. The computational 
efficiency of RegTools and increasing availability and size of such datasets may also allow for 
improved understanding of splice regulatory motifs that have proven difficult to accurately define 
such as exonic and intronic splicing enhancers and silencers. Any group with paired DNA and 
RNA-seq data for the same samples stands to benefit from the functionality of RegTools. 
 
Methods 
 
Software implementation 
 
RegTools is written in C++. CMake is used to build the executable from source code. We have 
designed the RegTools package to be self-contained in order to minimize external software 
dependencies. A Unix platform with a C++ compiler and CMake is the minimum prerequisite for 
installing RegTools. Documentation for RegTools is maintained as text files within the source 
repository to minimize divergence from the code. We have implemented common file handling 
tasks in RegTools with the help of open-source code from Samtools/HTSlib26 and BEDTools45 in 
an effort to ensure fast performance, consistent file handling, and interoperability with any 
aligner that adheres to the BAM specification. Statistical tests are conducted within RegTools 
using the RMath framework. Travis CI and Coveralls are used to automate and monitor software 
compilation and unit tests to ensure software functionality. We utilized the Google Test 
framework to write unit tests. 
 
RegTools consists of a core set of modules for variant annotation, junction extraction, junction 
annotation, and GTF utilities. Higher level modules such as cis-splice-effects make use of the 
lower level modules to perform more complex analyses. We hope that bioinformaticians familiar 
with C/C++ can re-use or adapt the RegTools code to implement similar tasks. 
 
 
Benchmarking 
 
Performance metrics were calculated for all RegTools commands. Each command was run with 
default parameters on a single blade server (Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz) 
with 10 GB of RAM and 10 replicates for each data point (Supplementary Figure 1). 
Specifically for cis-splice-effects identify, we started with random selections of somatic variants, 
ranging from 10,000-1,500,000, across 8 data subsets. Using the output from cis-splice-effects 
identify, variants annotate was run on somatic variants from the 8 subsets (range: 0-17,742) 
predicted to have a splicing consequence. The function junctions extract was performed on the 
HCC1395 tumor RNA-seq data aligned with HISAT to GRCh37 and randomly downsampled at 
intervals ranging from 10-100%. Using output from junctions extract, junctions annotate was 
performed for 7 data subsets ranging from 1,000-500,000 randomly selected junctions. 
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


11 

Benchmark tests revealed an approximately linear performance for all functions. Variance 
between real and CPU time is highly dependent on the I/O speed of the write-disk and could 
account for artificially inflated real time values given multiple jobs writing to the same disk at 
once. The most computationally expensive function in a typical analysis workflow was junctions 
extract, which on average processed 33,091 reads/second (CPU) and took an average of 43.4 
real vs 41.7 CPU minutes to run on a full bam file (82,807,868 reads total). The function 
junctions annotate was the next most computationally intensive function and took an average of 
33.0 real/8.55 CPU minutes to run on 500,000 junctions, processing 975 junctions/second 
(CPU).  The other functions were comparatively faster with cis-splice-effects identify and 
variants annotate able to process 3,105 and 118 variants per second (CPU), respectively. To 
process a typical candidate variant list of 1,500,000 variants and a corresponding RNA-seq 
BAM file of 82,807,868 reads with cis-splice-effects identify takes ~ 8.20 real/8.05 CPU minutes 
(Supplementary Figure 1). 
 
Performance metrics were also calculated for the statistics script and its associated wrapper 
script that handles dividing the variants into smaller chunks for processing to limit RAM usage.  
This command, compare_junctions, was benchmarked in January 2020 using Amazon Web 
Services (AWS) on a m5.4xlarge instance, based on the Amazon Linux 2 AMI,  with 64 Gb of 
RAM, 16 vCPUs, and a mounted 1 TB SSD EBS volume with 3000 IOPS. These data were 
generated from running compare_junctions on each of the included cohorts, with the largest 
being our BRCA cohort (1022 sample) which processed 3.64 events per second (CPU). 
 
Using RegTools to identify cis-acting, splice altering variants 
RegTools contains three sub-modules: “variants”, “junctions”, and “cis-splice-effects”. For 
complete instructions on usage, including a detailed workflow for how to analyze cohorts using 
RegTools, please visit regtools.org.  
 
Variants annotate 
This command takes a list of variants in VCF format. The file should be gzipped and indexed 
with Tabix46. The user must also supply a GTF file that specifies the reference transcriptome 
used to annotate the variants.  
 
The INFO column of each line in the VCF is populated with comma-separated lists of the 
variant-overlapping genes, variant-overlapping transcripts, the distance between the variant and 
the associated exon edge for each transcript (i.e. each start or end of an exon whose splice 
variant window included the variant) defined as min(distance_from_start_of_exon, 
distance_from_end_of_exon), and the variant type for each transcript.  
 
Internally, this function relies on HTSlib to parse the VCF file and search for features in the GTF 
file which overlap the variant. The splice variant window size (i.e. the maximum distance from 
the edge of an exon used to consider a variant as splicing-relevant) can be set by the options “-
e <number of bases>” and “-i <number of bases>” for exonic and intronic variants, respectively. 
The variant type for each variant thus depends on the options used to set the splice variant 
window size. Variants captured by the window set by “-e” or “-i” are annotated as 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


12 

“splicing_exonic” and “splicing_intronic”, respectively. Alternatively, to analyze all exonic or 
intronic variants, the “-E” and “-I” options can be used. Otherwise, the “-E” and “-I” options 
themselves do not change the variant type annotation, and variants found in these windows are 
labeled simply as “exonic” or “intronic”. By default, single exon transcripts are ignored, but they 
can be included with the “-S” option. By default, output is written to STDOUT in VCF format. To 
write to a file, use the option “-o <PATH/TO/FILE>”. 
 
Junctions extract 
This command takes an alignment file containing aligned RNA-seq reads and infers junctions 
(i.e. exon-exon boundaries) based on skipped regions in alignments as determined by the 
CIGAR string operator codes. These junctions are written to STDOUT in BED12 format. 
Alternatively, the output can be redirected to a file with the “-o <PATH/TO/FILE>”. RegTools 
ascertains strand information based on the XS tags set by the aligner, but can also determine 
the inferred strand of transcription based on the BAM flags if a stranded library strategy was 
employed. In the latter case, the strand specificity of the library can be provided using “-s 
<INT>” where 0 = unstranded, 1 = first-strand/RF, 2 = second-strand/FR. We suggest that users 
align their RNA-seq data with HISAT230, TopHat231, or STAR29, as these are the aligners we 
have tested to date. If RNA-seq data is unstranded and aligned with STAR, users must run 
STAR with the --outSAMattributes option to include XS tags in the BAM output.  
 
Users can set thresholds for minimum anchor length and minimum/maximum intron length. The 
minimum anchor length determines how many contiguous, matched base pairs on either side of 
the junction are required to include it in the final output. The required overlap can be observed 
amongst separated reads, whose union determines the thickStart and thickEnd of the BED 
feature. By default, a junction must have 8 bp anchors on each side to be counted but this can 
be set using the option “-a <minimum anchor length>”. The intron length is simply the end 
coordinate of the junction minus the start coordinate. By default, the junction must be between 
70 bp and 500,000 bp, but the minimum and maximum can be set using “-i <minimum intron 
length>” and “-I <maximum intron length>”, respectively.  
 
For efficiency, this tool can be used to process only alignments in a particular region as 
opposed to analyzing the entire BAM file. The option “-r <chr>:<start>-<stop>” can be used to 
set a single contiguous region of interest. Multiple jobs can be run in parallel to analyze 
separate non-contiguous regions. 
 
Junctions annotate 
This command takes a list of junctions in BED12 format as input and annotates them with 
respect to a reference transcriptome in GTF format. The observed splice-sites used are 
recorded based on a reference genome sequence in FASTA format. The output is written to 
STDOUT in TSV format, with separate columns for the number of splicing acceptors skipped, 
number of splicing donors skipped, number of exons skipped, the junction type, whether the 
donor site is known, whether the acceptor site is known, whether this junction is known, the 
overlapping transcripts, and the overlapping genes, in addition to the chromosome, start, stop, 
junction name, junction score, and strand taken from the input BED12 file. This output can be 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


13 

redirected to a file with “-o /PATH/TO/FILE”. By default, single exon transcripts are ignored in 
the GTF but can be included with the option “-S”. 
 
Cis-splice-effects identify 
This command combines the above utilities into a pipeline for identifying variants which may 
cause aberrant splicing events by altering splicing motifs in cis. As such, it relies on essentially 
the same inputs: a gzipped and Tabix-indexed VCF file containing a list of variants, an 
alignment file containing aligned RNA-seq reads, a GTF file containing the reference 
transcriptome of interest, and a FASTA file containing the reference genome sequence of 
interest.  
 
First, the list of variants is annotated. The splice variant window size is set using the options “-
e”, “-i”, “-E”, and “-I”, just as in variants annotate. The splice junction region size (i.e. the range 
around a particular variant in which an overlapping junction is associated with the variant) can 
be set using “-w <splice junction region size>”. By default, this range is not a particular number 
of bases but is calculated individually for each variant, depending on the variant type annotation. 
For “splicing_exonic”, “splicing_intronic”, and “exonic” variants, the region extends from the 3’ 
end of the exon directly upstream of the variant-associated exon to the 5’ end of the exon 
directly downstream of it. For “intronic” variants, the region is limited to the intron containing the 
variant. Single-exons can be kept with the “-S” option. The annotated list of variants in VCF 
format (analogous to the output of variants annotate) can be written to a file with “-v 
/PATH/TO/FILE”. 
 
The BAM file is then processed in the splice junction regions to produce the list of junctions. A 
file containing these junctions in BED12 format (analogous to the output of junctions extract) 
can be written using “-j /PATH/TO/FILE”. The minimum anchor length, minimum intron length, 
and maximum intron length can be set using “-a”, “-i”, and “-I” options, just as in junctions 
extract.  
 
The list of junctions produced by the preceding step is then annotated with the information 
presented in junctions annotate. Additionally, each junction is annotated with a list of associated 
variants (i.e. variants whose splice junction regions overlapped the junction). The final output is 
written to STDOUT in TSV format (analogous to the output of junctions annotate) or can be 
redirected to a file with “-o /PATH/TO/FILE”. 
 
Cis-splice-effects associate 
This command is similar to cis-splice-effects identify, but takes the BED output of junctions 
extract in lieu of an alignment file with RNA alignments. As with cis-splice-effects identify, each 
junction is annotated with a list of associated variants (i.e. variants whose splice junction regions 
overlapped the junction). The resulting output is then the same as cis-splice-effects identify, but 
limited to the junctions provided as input.  
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


14 

Analysis 
Dataset Description 
32 cancer cohorts were analyzed from TCGA. These cancer types are Adrenocortical 
carcinoma (ACC), Bladder Urothelial Carcinoma (BLCA), Brain Lower Grade Glioma (LGG), 
Breast invasive carcinoma (BRCA), Cervical squamous cell carcinoma and endocervical 
adenocarcinoma (CESC), Cholangiocarcinoma (CHOL), Colon adenocarcinoma (COAD), 
Esophageal carcinoma (ESCA), Glioblastoma multiforme (GBM), Head and Neck squamous cell 
carcinoma (HNSC), Kidney Chromophobe (KICH), Kidney renal clear cell carcinoma (KIRC), 
Kidney renal papillary cell carcinoma (KIRP), Liver hepatocellular carcinoma (LIHC), Lung 
adenocarcinoma (LUAD), Lung squamous cell carcinoma (LUSC), Lymphoid Neoplasm Diffuse 
Large B cell Lymphoma (DLBC), Mesothelioma (MESO), Ovarian serous cystadenocarcinoma 
(OV), Pancreatic adenocarcinoma (PAAD), Pheochromocytoma and Paraganglioma (PCPG), 
Prostate adenocarcinoma (PRAD), Rectum adenocarcinoma (READ), Sarcoma (SARC), Skin 
Cutaneous Melanoma (SKCM), Stomach adenocarcinoma (STAD), Testicular Germ Cell 
Tumors (TGCT), Thymoma (THYM), Thyroid carcinoma (THCA), Uterine Carcinosarcoma 
(UCS), Uterine Corpus Endometrial Carcinoma (UCEC), and Uveal Melanoma (UVM). Three 
cohorts were derived from patients at Washington University in St. Louis. These cohorts are 
Hepatocellular Carcinoma (HCC), Oral Squamous Cell Carcinoma (OSCC), and Small Cell 
Lung Cancer (SCLC).  
 
Sample processing 
We applied RegTools to 35 tumor cohorts. Genomic and transcriptomic data for 32 cohorts were 
obtained from The Cancer Genome Atlas (TCGA). Information regarding the alignment and 
variant calling for these samples is described by the Genomic Data Commons data 
harmonization effort47. Whole exome sequencing (WES) mutation calls for these samples from 
MuSE48, MuTect249, VarScan250, and SomaticSniper51, were left-aligned, trimmed, and 
decomposed to ensure the correct representation of the variants across the multiple callers. 
 
Samples for the remaining three cohorts, HCC, SCLC, and OSCC, were sequenced at 
Washington University in St. Louis. Genomic data were produced by WES for SCLC and OSCC 
and whole genome sequencing (WGS) for HCC. Normal genomic data of the same sequencing 
type and tumor RNA-seq data were also available for all subjects. Sequence data were aligned 
using the Genome Modeling System (GMS)52 using TopHat2 for RNA and BWA-MEM53 for 
DNA. HCC and SCLC were aligned to GRCh37 while OSCC was aligned to GRCh38. Somatic 
variant calls were made using Samtools v0.1.126, SomaticSniper2 v1.0.251, Strelka V0.4.6.254, 
and VarScan v2.2.650,54 through the GMS. High-quality mutations for all samples were then 
selected by requiring that a variant be called by two of the four variant callers.  
 
Candidate junction filtering 
To generate results for 4 splice variant window sizes, we ran cis-splice-effects identify with 4 
sets of splice variant window parameters. For our “i2e3” window (RegTools default), to examine 
intronic variants within 2 bases and exonic variants within 3 bases of the exon edge, we set “-i 2 
-e 3”. Similarly, for “i50e5”, to examine intronic variants within 50 bases and exonic variants 
within 5 bases of the exon edge, we set “-i 50 -e 5”. To view all exonic variants, we simply set “-

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


15 

E”, without “-i” or “-e” options. To view all intronic variants, we simply set “-I”, without “-i” or “-e” 
options. TCGA samples were processed with GRCh38.d1.vd1.fa (downloaded from the GDC 
reference file page at https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-
files) as the reference fasta file and gencode.v29.annotation.gtf (downloaded via the GENCODE 
FTP) as the reference transcriptome. OSCC was processed with 
Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa and Homo_sapiens.GRCh38.79.gtf 
(both downloaded from Ensembl). HCC and SCLC were processed with 
Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa and Homo_sapiens.GRCh37.87.gtf 
(both downloaded from Ensembl). 
 
Statistical filtering of candidate events 

We refer to a statistical association between a variant and a junction as an “event”. For each 

event identified by RegTools, a normalized score (norm_score) was calculated for the junction 

of the event by dividing the number of reads supporting that junction by the sum of all reads for 

all junctions within the splice junction region for the variant of interest. This metric is 

conceptually similar to a “percent-spliced in” (PSI) index, but measures the presence of entire 

exon-exon junctions, instead of just the inclusion of individual exons. If there were multiple 

samples that contained the variant for the event, then the mean of the normalized scores for the 

samples was computed (mean_norm_score). If only one sample contained the variant, its 

mean_norm_score was thus equal to its norm_score. This value was then compared to the 

distribution of samples which did not contain the variant to calculate a p-value as the percentage 

of the norm_scores from these samples which are at least as high as the mean_norm_score 

computed for the variant-containing samples. We performed separate analyses for events 

involving canonical junctions (DA) and those involving novel junctions which used at least one 

known splice site (D/A/NDA), based on annotations in the corresponding reference GTF. For 

this study, we filtered out any junctions which did not use at least one known splice site (N) and 

junctions which did not have at least 5 reads of evidence across variant-containing samples. 

The Benjamini-Hochberg procedure was then applied to the remaining events. Following 

correction, an event was considered significant if its adjusted p-value was ≤ 0.05. 

 
Annotation with GTEx junction data and other splice prediction tools 
Events identified by RegTools as significant were annotated with information from GTEx, VEP, 
SpliceAI, MiSplice, and Veridical. GTEx junction information was obtained from the GTEx Portal. 
Specifically, the exon-exon junction read counts file from the v8 release was used for data 
aligned to GRCh38 while the same file from the v7 release was used for the data aligned to 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


16 

GRCh37. Mappings between tumor cohorts and GTEx tissues can be found in Supplemental 
File 7.  We annotated all starting variants with VEP in the “per_gene” and “pick” modes. The 
“per_gene” setting outputs only the most severe consequence per gene while the “pick” setting 
picks one line or block of consequence data per variant. We considered any variant with at least 
one splicing-related annotation to be “VEP significant”. All variants were also processed with 
SpliceAI using the default options. A variant was considered to be “SpliceAI significant” if it had 
at least one score greater than 0.2, the developers’ value for high recall of their model. Variants 
identified by MiSplice20 were obtained from the paper supplemental tables and were lifted over 
to GRCh38. Variants identified by SAVNet23 were obtained from the paper supplemental tables 
and were lifted over to GRCh38. Variants identified by Veridical21,22 were obtained via download 
from the link reference within the manuscript and lifted over to GRCh38. 
 
Visual exploration of statistically significant candidate events 
IGV sessions were created for each event identified by RegTools that was statistically 
significant. Each IGV session file contained a bed file with the junction, a vcf file with the variant, 
and an alignment file for each sample that contained the variant. Additional information, such as 
the splice sites predicted by SpliceAI, were also added to these session files to enhance the 
exploration of these events. Events of interest were manually reviewed in IGV to assess 
whether the association between the variant and junction made sense in a biological context 
(e.g. affected a known splice site, altered a genomic sequence to look more like a canonical 
splice site, or the novel junction disrupted active or regulatory domains of the protein product). 
An extensive review of literature and visualizations of junction usage in the presence and 
absence of the variant were also used to identify novel, biologically relevant events. 
 
 
Identification of genes with recurrent splice altering variants 
For each cohort, we calculated a p-value to assess whether the splicing profile from a particular 
gene was significantly more likely to be altered by somatic variants. Specifically, we performed a 
1-tailed binomial test, considering the number of samples in a cohort as the number of attempts. 
Success was defined by whether the sample had evidence of at least one splice-altering variant 
in that gene. The null probability of success, pnull was calculated as 
 

where s is the total number of base positions residing in any of the gene’s splice variant 
windows, V is the event that a somatic variant occurred at such a base position, and A is the 
event that this variant was deemed to be significantly associated with at least one junction in our 
analysis. The joint probability that both V and A occurred was estimated by dividing the total of 
events across all samples in which each junction was detected by s. The value of s was 
computed based on the exon and transcript definitions in the reference GTF used for performing 
RegTools analyses on a given cohort.  
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


17 

We also calculated overall metrics, in order to rank genes. For each set of cohorts (e.g. TCGA-
only, MGI-only, combined), an overall p-value was computed for each gene according to the 
above formula, pooling all of the samples across the included cohorts, and the fraction of 
samples was simply calculated by dividing the number of samples in which an event occurred 
within the given gene by the total number of samples, pooled across the included cohorts. The 
reference GTF used for analyzing the TCGA samples (i.e. gencode.v29.annotation.gtf) was 
used for all sets of cohorts. 
 
Code availability 
RegTools is open source (MIT license) and available at https://github.com/griffithlab/regtools/. 
All scripts used in the analyses presented here are also provided. For ease of use, a Docker 
container has been created with RegTools, R, and Python 3 installed 
(https://hub.docker.com/r/griffithlab/regtools/). This Docker container allows a user to run the 
workflow we outline at https://regtools.readthedocs.io/en/latest/workflow/. Docker is an open-
source software platform that enables applications to be readily installed and run on any 
system. The availability of RegTools with all its dependencies as a Docker container also 
facilitates the integration of the RegTools software into workflow pipelines that support Docker 
images. 
 
Data availability 
Sequence data for each cohort analyzed in this study are available through dbGaP at the 
following accession IDs: phs000178 for TCGA cohorts, phs001106 for HCC, phs001049 for 
SCLC, and phs001623 for OSCC. Statistically significant events for D, A, and NDA junctions 
across the four variant splicing windows used are available via Supplemental Files 1 and 2.  
Statistically significant events for DA junctions are available as Supplemental Files 3 and 4. 
Complete results of gene recurrence analysis are available as Supplemental Files 5 and 6. 
 

Acknowledgments 
We thank the patients and their families for donation of their samples and participation in clinical 
trials. We would like to thank Donald Conrad for his initial idea to compare to variant effect 
predictor tools. Kelsy Cotto was supported by Siteman Cancer Center under fund number 
#3477-92400 and T32CA113275. Avinash Ramu was supported by the ‘Burroughs Wellcome 
Fund Institutional Program Unifying Population and Laboratory Based Sciences Award’ at 
Washington University. Malachi Griffith was supported by the National Human Genome 
Research Institute (NHGRI) of the National Institutes of Health (NIH) under Award Number 
R00HG007940. Malachi Griffith and Obi Griffith were supported by the NIH National Cancer 
Institute (NCI) under Award Numbers U01CA209936, U01CA231844, U01CA248235 
U24CA237719. Malachi Griffith and Megan Richters were supported by the V Foundation for 
Cancer Research under Award Number V2018-007. The results published here are in whole or 
part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. 
 
Contributions 
K.C.C. and Y.-Y.F. were involved in all aspects of this study, including designing methodology, 
developing and testing the tool software, analyzing and interpreting data, and writing the 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


18 

manuscript, with input from A.R., Z.L.S., M.R., S.F., J.K., O.L.G., and M.G. A.R. designed the 
tool and led software development efforts. Y.L., W.C.C., R.U., and R.G. provided unpublished 
tumor datasets and provided critical feedback on the manuscript. O.L.G. and M.G. supervised 
the study. All authors read and approved the final manuscript. 
 
Conflicts of Interest 
W. Chapman serves on the advisory board for Novartis Pharmaceutical and reports intellectual 
property with Pathfinder Therapeutics. R. Uppaluri reports grants and personal fees from Merck 
Inc. R. Govindan served as consultant for Horizon Pharmaceuticals and GenePlus.  
 
References 

1. Chabot, B. & Shkreta, L. Defective control of pre-messenger RNA splicing in human 

disease. J. Cell Biol. 212, 13–27 (2016). 

2. Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013). 

3. Soemedi, R. et al. Pathogenic variants that alter protein code often disrupt splicing. Nat. 

Genet. 49, 848–855 (2017). 

4. Supek, F., Miñana, B., Valcárcel, J., Gabaldón, T. & Lehner, B. Synonymous mutations 

frequently act as driver mutations in human cancers. Cell 156, 1324–1335 (2014). 

5. Jung, H. et al. Intron retention is a widespread mechanism of tumor-suppressor 

inactivation. Nat. Genet. 47, 1242–1248 (2015). 

6. Venables, J. P. Aberrant and alternative splicing in cancer. Cancer Res. 64, 7647–7654 

(2004). 

7. Climente-González, H., Porta-Pardo, E., Godzik, A. & Eyras, E. The Functional Impact of 

Alternative Splicing in Cancer. Cell Rep. 20, 2215–2226 (2017). 

8. Chen, J. & Weiss, W. A. Alternative splicing in cancer: implications for biology and therapy. 

Oncogene 34, 1–14 (2015). 

9. Xiong, H. Y. et al. RNA splicing. The human splicing code reveals new insights into the 

genetic determinants of disease. Science 347, 1254806 (2015). 

10. Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with 

applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004). 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


19 

11. Fairbrother, W. G., Yeh, R.-F., Sharp, P. A. & Burge, C. B. Predictive identification of exonic 

splicing enhancers in human genes. Science 297, 1007–1013 (2002). 

12. Wang, Z. et al. Systematic identification and analysis of exonic splicing silencers. Cell 119, 

831–845 (2004). 

13. Jaganathan, K. et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 

176, 535–548.e24 (2019). 

14. Kahles, A., Ong, C. S., Zhong, Y. & Rätsch, G. SplAdder: identification, quantification and 

testing of alternative splicing events from RNA-Seq data. Bioinformatics 32, 1840–1847 

(2016). 

15. Trincado, J. L. et al. SUPPA2: fast, accurate, and uncertainty-aware differential splicing 

analysis across multiple conditions. Genome Biol. 19, 40 (2018). 

16. Kahles, A. et al. Comprehensive Analysis of Alternative Splicing Across Tumors from 8,705 

Patients. Cancer Cell 34, 211–224.e6 (2018). 

17. Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 

50, 151–158 (2018). 

18. Monlong, J., Calvo, M., Ferreira, P. G. & Guigó, R. Identification of genetic variants 

associated with alternative splicing using sQTLseekeR. Nat. Commun. 5, 4698 (2014). 

19. Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science 

352, 600–604 (2016). 

20. Jayasinghe, R. G. et al. Systematic Analysis of Splice-Site-Creating Mutations in Cancer. 

Cell Rep. 23, 270–281.e3 (2018). 

21. Viner, C., Dorman, S. N., Shirley, B. C. & Rogan, P. K. Validation of predicted mRNA 

splicing mutations using high-throughput transcriptome data. F1000Res. 3, (2014). 

22. Shirley, B. C., Mucaki, E. J. & Rogan, P. K. Pan-cancer repository of validated natural and 

cryptic mRNA splicing mutations. F1000Res. 7, 1908 (2018). 

23. Shiraishi, Y. et al. A comprehensive characterization of cis-acting splicing-associated 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


20 

variants in human cancer. Genome Res. 28, 1111–1125 (2018). 

24. GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–

585 (2013). 

25. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016). 

26. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–

2079 (2009). 

27. Sondka, Z. et al. The COSMIC Cancer Gene Census: describing genetic dysfunction 

across all human cancers. Nat. Rev. Cancer 18, 696–705 (2018). 

28. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011). 

29. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). 

30. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory 

requirements. Nat. Methods 12, 357–360 (2015). 

31. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, 

deletions and gene fusions. Genome Biol. 14, R36 (2013). 

32. Conway, J. R., Lex, A. & Gehlenborg, N. UpSetR: an R package for the visualization of 

intersecting sets and their properties. Bioinformatics 33, 2938–2940 (2017). 

33. Surget, S., Khoury, M. P. & Bourdon, J.-C. Uncovering the role of p53 splice variants in 

human malignancy: a clinical perspective. Onco. Targets. Ther. 7, 57–68 (2013). 

34. Tokheim, C. & Karchin, R. CHASMplus Reveals the Scope of Somatic Missense Mutations 

Driving Human Cancers. Cell Syst 9, 9–23.e8 (2019). 

35. Bicknell, D. C., Kaklamanis, L., Hampson, R., Bodmer, W. F. & Karran, P. Selection for β2-

microglobulin mutation in mismatch repair-defective colorectal carcinomas. Curr. Biol. 6, 

1695–1697 (1996). 

36. Bonneville, R. et al. Landscape of Microsatellite Instability Across 39 Cancer Types. JCO 

Precis Oncol 2017, (2017). 

37. Kloor, M. et al. Immunoselective pressure and human leukocyte antigen class I antigen 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


21 

machinery defects in microsatellite unstable colorectal cancers. Cancer Res. 65, 6418–

6424 (2005). 

38. Sade-Feldman, M. et al. Resistance to checkpoint blockade therapy through inactivation of 

antigen presentation. Nat. Commun. 8, 1136 (2017). 

39. Seliger, B., Maeurer, M. J. & Ferrone, S. Antigen-processing machinery breakdown and 

tumor growth. Immunol. Today 21, 455–464 (2000). 

40. Güssow, D. et al. The human beta 2-microglobulin gene. Primary structure and definition of 

the transcriptional unit. J. Immunol. 139, 3132–3138 (1987). 

41. Wang, L., Yin, W. & Shi, C. E3 ubiquitin ligase, RNF139, inhibits the progression of tongue 

cancer. BMC Cancer 17, 452 (2017). 

42. Hornbeck, P. V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic 

Acids Res. 43, D512–20 (2015). 

43. Zhao, R., Choi, B. Y., Lee, M.-H., Bode, A. M. & Dong, Z. Implications of Genetic and 

Epigenetic Alterations of CDKN2A (p16(INK4a)) in Cancer. EBioMedicine 8, 30–39 (2016). 

44. Gump, J., Stokoe, D. & McCormick, F. Phosphorylation of p16 INK4A Correlates with Cdk4 

Association. J. Biol. Chem. 278, 6619–6622 (2003). 

45. Quinlan, A. R. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr. 

Protoc. Bioinformatics 47, 11.12.1–34 (2014). 

46. Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. 

Bioinformatics 27, 718–719 (2011). 

47. GDC Data Processing. https://gdc.cancer.gov/about-data/gdc-data-processing. 

48. Fan, Y. et al. Accounting for tumor heterogeneity using a sample-specific error model 

improves sensitivity and specificity in mutation calling for sequencing data. bioRxiv 055467 

(2016) doi:10.1101/055467. 

49. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and 

heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013). 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


22 

50. Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in 

cancer by exome sequencing. Genome Res. 22, 568–576 (2012). 

51. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole 

genome sequencing data. Bioinformatics 28, 311–317 (2012). 

52. Griffith, M. et al. Genome Modeling System: A Knowledge Management Platform for 

Genomics. PLoS Comput. Biol. 11, e1004274 (2015). 

53. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler 

transform. Bioinformatics 25, 1754–1760 (2009). 

54. Saunders, C. T. et al. Strelka: accurate somatic small-variant calling from sequenced 

tumor–normal sample pairs. Bioinformatics 28, 1811–1817 (2012). 

  
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


23 

Main Figures 

 
Figure 1: Flexible, streamlined discovery of cis-acting splice variants with 
RegTools modules and cis-splice-effects identify workflow. 
A) By default, variants annotate marks variants within 3bp on the exonic side and 2bp on the 
intronic side of an exon edge as potentially splicing-relevant. This “splice variant window” can be 
modified individually for the exonic side and intronic side using the “-e” and “-i” options, 
respectively. With cis-splice-effects identify, for each variant in the splice variant window, a 
“splice junction region” is determined by finding the largest span of sequence space between 
exons which flank the exon associated with the splicing-relevant variant. The splice junction 
region can also be set manually to contain the entire sequence space n bases upstream and 
downstream of the variant using the “-w” option. Junctions overlapping the splice junction region 
are associated with the variant. Using the -E option considers all exonic variants as potentially 
splicing-relevant, but is otherwise the same. The -I option considers all intronic variants and also 
limits the splice junction region to the intronic region in which the variant is found, excluding the 
flanking exons. B) Cis-splice-effects identify and the underlying junctions annotate command 
annotate splicing events based on whether the donor and acceptor site combination is found in 
the reference transcriptome GTF. In this example, there are two known transcripts (shown in 
blue) which overlap a set of junctions from RNAseq data (depicted as junction supporting reads 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


24 

in red). Comparing the observed junctions to the reference junctions in the first transcript (top 
panel), RegTools checks to see if the observed donor and acceptor splice sites are found in any 
of the reference exons and also counts the number of exons, acceptors, and donors skipped by 
a particular junction. Double arrows represent matches between observed and reference 
acceptor/donor sites while single arrows show novel splice sites. These steps are repeated for 
the rest of the relevant transcripts, keeping track of whether there are known acceptor-donor 
combinations. Junctions with a known donor but novel acceptor or vice-versa are annotated as 
“D” or “A”, respectively. If both sites are known but do not appear in combination in any 
transcripts, the junction is annotated as “NDA”, whereas if both sites are unknown, the junction 
is annotated as “N”. If the junction is known to the reference GTF, it is marked as “DA”. C) The 
cis-splice-effects identify command relies on the variants annotate, junctions extract, and 
junctions annotate submodules. This pipeline takes variant calls and RNA-seq alignments along 
with genome and transcriptome references and outputs information about novel junctions and 
associated potential cis splice-altering sequence variants. RegTools is agnostic to downstream 
research goals and its output can be filtered through user-specific methods and thus can be 
applied to a broad set of scientific questions. 
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


25 

 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


26 

 
Figure 2. Overview of input data considered and significant events identified by 
RegTools for each tumor type. 
A) Summary of initial variants considered for analysis by RegTools per sample per tumor 
cohort. Each sample’s variant count is plotted and violin plots are overlaid for each cohort. B) 
Summary unique exon-exon junction observations for each sample. Each sample’s unique 
junction count is plotted and violin plots are overlaid for each cohort. C) Summary of significant 
junction types for each cohort across each of the variant window sizes that were used in this 
analysis.  
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


27 

 
Figure 3. Splice regulatory variants often lead to the expression of multiple 
alternative junctions. 
A)  A single variant can result in either one or more than one alternatively spliced junctions. 
Depicted is a variant resulting in a single novel transcript product (purple), a variant resulting in 
two novel transcript products that both use alternate donor sites (yellow), and a variant resulting 
in multiple junctions of different types (teal). B) Stacked bar graph visualizing how often a 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


28 

variant leads to each of the categories mentioned above across the four RegTools variant 
windows used. This analysis is for all variants that RegTools identified as significant. C) Bar 
chart showing how often each of the described junction combinations occurs when a single 
variant results in multiple junction types across each of the RegTools splice variant windows 
used. 
 
 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


29 

 
Figure 4. Comparison of RegTools with other tools that identify potential splice 
altering variants. 
A) Conceptual diagram of contrasting approaches used to identify splice regulatory 
tools/methods. A red dot indicates that the source only considers genomic data for making its 
calls, as opposed to a combination of genomic and transcriptomic data. B) UpSet plot 
comparing splice altering variants identified by RegTools to those identified by other splice 
variant predictors and annotators. Each tool and their total number of variant predictions are 
shown on the left side bar graph. The numbers of variants specific to each tool or shared 
between different combinations of tools are indicated by the bar graph along the top and the 
individual or connected dots. 
 
 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


30 

 
Figure 5. Pan-cancer analysis of cohorts from TCGA and MGI reveals genes 
recurrently disrupted by variants which cause non-canonical splicing patterns 
Results of analysis for recurrently disrupted genes in each cohort. Columns correspond 
to the 20 most frequently recurring genes, as ranked by fraction of samples. Genes are 
clustered by whether they were annotated by the CGC as an oncogene (red), an 
oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


31 

another type of cancer-relevant gene. Shading corresponds to −log10(p value) and 

columns represent cancer types. Red marks within cells indicate that the gene was 

annotated by CHASMplus as a driver within a given TCGA cohort.  

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


32 

 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


33 

 
Figure 6. Several SNVs in B2M associated with alternate acceptor and alternate 
donor usage. 
 
A) IGV snapshot of three intronic variant positions found to be associated with usage of an 
alternate acceptor and alternative donor site that leads to formation of novel transcript products. 
This result was found using the default splice variant window parameter (i2e3). B) Zoomed in 
view of the variants identified by RegTools that are associated with alternate acceptor and 
donor usage. Two of these variant positions flank the acceptor site and one flanks the donor site 
that are being affected. C) Sashimi plot visualizations for samples containing the identified 
variants that show alternate acceptor usage (red) or alternate donor usage (orange). 
 
 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


34

Supplemental Figures 

 
Supplementary Figure 1. Benchmarking of each RegTools command. 
The total CPU time (System Time + User Time) and real time are plotted against the number of 
entries processed for each available RegTools function using 10 total replicates. For the cis-
splice-effects identify/cis-splice-effects associate/variants annotate workflows, the number of 
entries corresponds to the number of somatic variants, whereas the number of entries in the 
junctions extract/junctions annotate/compare_junctions workflows corresponds to the number of 
reads processed from a downsampled BAM file, the number of junctions processed, and the 
number of candidate variant junction pairings processed, respectively. For compare_junctions, 
candidate variant junction pairings were compared across the number of samples in that cohort, 
with the largest being 1022 samples that comprise our BRCA cohort. LOESS curves are fitted 
onto each plot. 
 

34 

 
of 

rt, 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


35 

 
Supplementary Figure 2. Summary of variants analyzed by RegTools in each 
tumor cohort 
Summary of the starting number of high quality variants per sample, the number of initial 
variants considered for analysis by RegTools for each variant window used per tumor cohort, 
and the number of significant variants for each variant window used per tumor cohort. 
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


36 

 
Supplementary Figure 3. Visualization of junctions across cohorts. 
Summary of the total junction read counts, unique junctions (all types), unique known (DA) 
junctions, unique known (DA) junctions not found in GTEx, unique D, A, NDA junctions, and 
unique D, A, NDA junctions not found in GTEx per sample per cohort. 
 
 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


37 

 
Supplementary Figure 4: Intronic SNV in CTTN associated with an exon skipping 
event. 
 
A) IGV snapshot of a single nucleotide variant (GRCh38, chr11:g.70407517G>C) within an 
intron of CTTN in LUAD sample TCGA-86-6851-01A. This variant is associated with an exon 
skipping event causing the formation of an NDA junction, JUNC00027688, which has 44 reads 
of support. The variant was identified by RegTools, VEP, and Veridical but no other tools. This 
result was found using the default splice variant window parameter (i2e3). B) Sashimi plot 
visualization of the novel junction. 
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


38 

 
Supplementary Figure 5: Exonic SNV in LZTR1 associated with alternative donor 
usage. 
 
A) IGV snapshot of a single nucleotide variant (GRCh38, chr22:g.20995026G>C) within an 
exon of LZTR1 in LUAD sample TCGA-38-4631-01A. This variant is associated with the 
formation of an A junction, JUNC00075013, which has 49 reads of support. The variant was 
identified by RegTools, VEP, and SpliceAI but no other tools. This result was found using the 
default splice variant window parameter (i2e3). B) Sashimi plot visualization of the novel 
junction. 
 
 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


39 

 
Supplementary Figure 6. Pan-cancer analysis of cohorts from TCGA and MGI 
reveals genes recurrently disrupted by variants which cause non-canonical 
splicing patterns 
Results of analysis for recurrently disrupted genes in each cohort. A) Rows correspond 
to the 40 most frequently recurring genes, as ranked by binomial p-value. Genes are 
clustered by whether they were annotated by the CGC as an oncogene (red), an 
oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or 

another type of cancer-relevant gene. Shading corresponds to −log10(p value) and 

columns represent cancer types. Red marks within cells indicate that the gene was 

annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows correspond 

to the 40 most frequently recurring genes, as ranked by fraction of samples. Shading 
corresponds to the fraction of samples and columns represent cancer types. Red marks 
within cells indicate that the gene was annotated by CHASMplus as a driver within a 
given TCGA cohort. 
 
 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


40 

 
Supplementary Figure 7. Pan-cancer analysis of cohorts from TCGA and MGI 
reveals genes recurrently disrupted by variants which promote splicing of 
particular canonical junctions 

Results of analysis for recurrently disrupted genes in each cohort. A) Rows correspond 

to the 40 most frequently recurring genes, as ranked by binomial p-value. Genes are 

clustered by whether they were annotated by the CGC as an oncogene (red), an 

oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or 

another type of cancer-relevant gene. Shading corresponds to −log10(p value) and 

columns represent cancer types. Red marks within cells indicate that the gene was 

annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows correspond 

to the 40 most frequently recurring genes, as ranked by fraction of samples. Shading 
corresponds to the fraction of samples and columns represent cancer types. Red marks 
within cells indicate that the gene was annotated by CHASMplus as a driver within a 
given TCGA cohort. 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


41 

 
Supplementary Figure 8. TCGA pan-cancer analysis reveals genes recurrently 
disrupted by variants which cause non-canonical splicing patterns 

Results of analysis for recurrently disrupted genes in each TCGA cohort. A) Rows 

correspond to the 40 most frequently recurring genes, as ranked by binomial p-value. 

Genes are clustered by whether they were annotated by the CGC as an oncogene 

(red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene 

(green), or another type of cancer-relevant gene.  Shading corresponds to −log10(p 

value) and columns represent cancer types. Red marks within cells indicate that the 

gene was annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows 

correspond to the 40 most frequently recurring genes, as ranked by fraction of samples. 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


42 

Shading corresponds to the fraction of samples and columns represent cancer types. 
Red marks within cells indicate that the gene was annotated by CHASMplus as a driver 
within a given TCGA cohort. 
 
 
Supplementary Figure 9. TCGA pan-cancer analysis reveals genes recurrently 
disrupted by variants which promote splicing of particular canonical junctions 

Results of analysis for recurrently disrupted genes in each TCGA cohort. A) Rows 

correspond to the 40 most frequently recurring genes, as ranked by binomial p-value. 

Genes are clustered by whether they were annotated by the CGC as an oncogene 

(red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene 

(green), or another type of cancer-relevant gene. Shading corresponds to −log10(p 

value) and columns represent cancer types. Red marks within cells indicate that the 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


43 

gene was annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows 

correspond to the 40 most frequently recurring genes, as ranked by fraction of samples. 
Shading corresponds to the fraction of samples and columns represent cancer types. 
Red marks within cells indicate that the gene was annotated by CHASMplus as a driver 
within a given TCGA cohort. 
 
 
Supplementary Figure 10. Analysis of HCC, OSCC, and SCLC cohorts reveals 
genes recurrently disrupted by variants which cause non-canonical splicing 
patterns 

Results of analysis for recurrently disrupted genes in each MGI cohort. A) Rows 

correspond to the 3 most frequently recurring genes, as ranked by binomial p-value. 

Shading corresponds to −log10(p value) and columns represent cancer types. B) Rows 

correspond to the 3 most frequently recurring genes, as ranked by fraction of samples. 
Shading corresponds to the fraction of samples and columns represent cancer types.  
 
 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


44 

 
Supplementary Figure 11. Analysis of HCC, OSCC, and SCLC cohorts reveals 
genes recurrently disrupted by variants which promote splicing of particular 
canonical junctions 

Results of analysis for recurrently disrupted genes in each TCGA cohort. A) Rows 

correspond to the 4 most frequently recurring genes, as ranked by binomial p-value. 

Shading corresponds to −log10(p value) and columns represent cancer types. B) Rows 

correspond to the 4 most frequently recurring genes, as ranked by fraction of samples. 
Shading corresponds to the fraction of samples and columns represent cancer types.  
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


45 

Supplementary Figure 12: Intronic SNV in TP53 associated with alternative 
donor usage. 
 
A) IGV snapshot of a single nucleotide variant (GRCh38, chr17:g.7673609C>A) within an intron 
of TP53 in an OSCC sample. This variant is associated with an exon skipping event with 23 
reads of support and an alternate acceptor site usage with 41 reads of support. This result was 
found using the default splice variant window parameter (i2e3). B) Sashimi plot visualization of 
the novel junction. 
 

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


46 

 
Supplementary Figure 13: Intronic deletion in RNF145 associated with alternative 
donor usage. 
 
A) IGV snapshot of a single nucleotide variant (GRCh38, chr5:g.159169058delA) within an 
intron of RNF145 in COAD samples. This variant is associated with an exon skipping event with 
8 and 6 reads of support for the samples shown. This result was found using the default splice 
variant window parameter (i2e3). B) Sashimi plot visualization of the novel junction. 
 
 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


47  

.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/


48 

Supplementary Figure 14: Several SNVs in CDKN2A associated with alternate 
donor usage. 
 
A) IGV snapshot of three variant positions in CDKN2A found to be associated with usage of an 
alternate donor site that leads to formation of an alternate known transcript. This result was 
found using the default splice variant window parameter (i2e3) for known (DA) junctions. B) 
Zoomed in view of the variants identified by RegTools that are associated with alternate donor 
usage. Two of these variant positions flank the donor site that is no longer being used. C) 
Sashimi plot visualizations for samples containing the identified variants that show alternate 
donor usage. 
 
 
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under 

The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint 

https://doi.org/10.1101/436634
http://creativecommons.org/licenses/by-nc-nd/4.0/