test

Computing the Role of Alternative Splicing in Cancer


Most human genes undergo alternative splicing (AS), and dysregulation of alternative splicing contributes to tumor initiation and progression. Computational analysis of genomic and transcriptomic data enables the systematic characterization of alternative splicing and its functional role in cancer. In this review, we summarize the latest computational approaches to studying alternative splicing in cancer and the current limitations of the most popular tools in this field. Finally, we describe some of the current computational challenges in the characterization of the role of alternative splicing in cancer.


mRNA Splicing and Altered Regulation in Cancer Pre-mRNA splicing is required for the maturation of almost all mammalian mRNAs. Alternative splicing refers to the process by which a pre-mRNA can be processed into different mature mRNA molecules in which an exon/intron could be differentially included/excluded by the choice of alternative specific splice sites (Box 1). AS enables variable transcripts from the same DNA template, and plays an extensive role in generating protein complexity [1]. It has been estimated that in humans around 95% of genes undergo AS to produce a large variety of transcripts in a cell, tissue type, and condition-specific manner [2,3], which suggests that most cellular processes are dependent on the splicing machinery. Accumulated evidence shows that aberrations in the splicing process could contribute to cancer initiation, progression, and treatment failure through switching isoform expression of key proteins involved in apoptosis, metabolism, and cell signaling [4,5]. For instance, alternate isoforms of pyruvate kinase M (PKM) and epidermal growth factor receptor (EGFR) that are frequently expressed in glioma affect metabolism and promote tumor proliferation [6,7]. The variant isoform of CD44 is well studied in many cancer types and associated with epithelial to mesenchymal transition [8,9]. In melanoma, expression of splicing isoforms of BRAF(V600E) lacking the RAS-binding domain confers resistance to RAF inhibitors [10]. Similarly, in prostate cancer, expression of the androgen-receptor isoform encoded by splice variant 7 lacking the ligand-binding domain is associated with resistance to enzalutamide and abiraterone [11]. Cancer-associated AS events can occur by two main mutation mechanisms. cis-acting somatic mutations can hinder splicing of individual introns or generate new splice sites. For instance, distinct splice-altering mutations are found in the p53 tumor suppressor gene (TP53), introducing novel stop codons that truncate the protein [12]. Splicing can also be deregulated by trans-acting mutations in splicing regulatory proteins including serine/arginine-rich (SR) proteins, heterogenous nuclear ribonucleoproteins (hnRNPs), and other splicing factors (Box 1). Dysfunction of such proteins may have a larger impact on splicing dysregulation and even alter the entire transcription network [13]. Recently, large-scale genomic analysis has revealed the mutational landscape of splicing-related genes in human cancers [14] and provided genetic evidence directly linking RNA splicing regulation to cancer (Box 2)


Given the high prevalence of splicing dysfunction in cancer and its pervasive effect on the transcriptome, significant computational efforts are needed and have been invested for the identification and quantification of AS events on a genome-wide scale. Computational analyses provide a more complete understanding of how splicing dysfunction alter splicing globally in cancer, and become a fundamental step before downstream experimental investigation in most studies of cancer splicing. The scope of this review is the discussion of the latest development, possible improvement and current challenges of computational studies in characterization of the role of AS in cancer.

 

 

Computational Deciphering of Splicing Dysregulation The increase in read depth and decrease in cost of high-throughput RNA-sequencing data (RNA-seq) has enabled the systematic characterization of alternative splicing in a context dependent manner (Figure 1). The analysis of these data was enabled by a variety of computational tools developed in the last few years [15–17]. However, the output of these tools varies significantly, sometimes with dramatic differences, leading to conflicting interpretations [16].


These computational tools mainly fall into two methodological categories; AS detection at the whole transcript or specific event level (Figure 1). Early studies used transcriptome deconvolution to reconstruct full-length isoforms and quantify the relative expression abundances of each isoform (e.g., Cufflinks [18], DiffSplice [19], and MISO [20]). However, transcriptome reconstruction is overall a challenging problem and is especially complicated in long genes with many transcripts [21]. It is often more convenient to directly focus on each AS event given the specific exon and junction information. For this reason, most of the extensively used and validated tools used today are event-based (for instance, rMATS [22], MAJIQ [23], and JuncBASE [24]). In these tools, local AS events are first identified in each sample using variable exon reads and junction reads (linking exons or cryptic intronic splice sites) between biological conditions or from a background annotation dataset. Next, a value is assigned to quantify the ratio of expression switch on each AS event. The most commonly used measure is called percent-spliced-in (PSI), a value in the interval zero to one, which provides the fraction of mRNA reads supporting each AS event. Adjustments of PSI evaluation can be found across different tools, including normalization to junction and read length (rMATS), correcting for GC content (MAJIQ), and batch difference between samples (JUM [21]). After quantification and correction, one can identify significant alternatively spliced events using proper statistical evaluation across experimental conditions.


While many tools limit their detection power to currently well-annotated references, detecting unannotated events with novel splice sites requires different strategies. For instance, SF3B1 hotspot mutations induce novel upstream 3′ splice sites (3′ss); many of which are not reported in the latest annotation of functional isoforms [25]. One conventional solution is to enlarge the feed-in reference by generating a dataset-specific .gtf file, by using tools (e.g., Cufflinks) to conduct the de novo isoform reconstruction. Most of the leading tools have been updated in recent years to include the feature of novel AS detection, which is more computationally intensive and often requires additional experimental validation. These computational tools mainly report five common patterns of AS: skipping or inclusion of a cassette exon, alternative 5′ss or 3′ss choice, intron retention, and mutually exclusive exons (Figure 1), although certain complex or mixed pattern of AS can occur [21]. Many of these tools (e.g., rMATS, MISO, and JuncBASE) preferentially report exon-inclusion/skipping events, which are the most frequent AS pattern in animals. However, intron-related AS events have drawn increasing attention for their role in understanding tumorigenesis [26] and treatment design [27]. Identifying true-positive intron retention events is a difficult task, as it requires manual review of putative events in Integrative Genomics Viewer (IGV) due to the repetitive nature of intronic sequences (inaccurate read mapping) or unannotated small/noncoding transcripts from the antisense strand. Recently, an annotation-free tool, JUM, has been specifically designed for quantifying intron retention by requiring approximately uniformly distribution of reads across the entire intronic region to reduce false-positive calls


Some studies are not designed with distinct conditions affecting splicing, for instance, investigating any potential effects of splicing in a specific tumor cohort without prior knowledge of any splicing changes. These studies proceed first by the description and characterization of all AS events and then by the identification of potential regulators of these AS events. The most straightforward way is to directly correlate the inclusion level of each AS with different RNA-binding protein (RBP) status (e.g., genetic alterations or transcriptomic expression). This approach was used in a trans-splicing quantitative trait loci (sQTL) analysis that linked somatic single nucleotide variant (SNV) positions with alternative splicing changes in 8255 samples [28]. In another example, all known binding motifs of each RBP were screened for a significant enrichment for matching nucleotide sequences in alternative splicing regions [29]. Such analyses are limited, because not all key splicing-related proteins directly bind to RNA, and not all RBPs have been confirmed with high confidence motifs. A systematic evaluation of differential splicing tools applied to four datasets, using PCR-validated splicing events as the background truth, found that MAJIQ and rMATS out-performed other tools overall [16]. However, it is still highly recommended to use more than one tool due to the relatively large variability of the results reported by the different approaches [16]. Besides direct employment of publicly available tools, uniquely designed/modified algorithms with enhanced sensitivity and specificity will no doubt be more powerful when applied to specific datasets by incorporating prior knowledge of the context-dependent scientific question. Notably, the computational workflow described above is built on second-generation short-read RNA-seq technique. Natural limitations of short-read sequencing have an impact on AS detection, such as low unique-mapping rate, especially at complex loci. However, short-read RNAseq still represents the standard and widely used method in cancer splicing analysis, not merely because the extensive computational efforts (as summarized earlier) but the low cost to produce high throughput reads. Intriguingly, a few studies have estimated the effect of sequencing depth and length of short read RNA-seq on splicing analysis [16,30,31]. Overall, these studies suggest a minimum of 50 million reads per sample and a length of 100 bp serving as a baseline for accurate splicing quantification. The increasing use of long-read Nanopore or PacBio sequencing (see Glossary) have provided improved reconstruction of the full spectrum of isoform profiles and solutions to many of the drawbacks of using short-reads in splicing (for instance, identification of full-length transcripts with retained introns). To date, growing interests and requirements accelerate the fast-pace development of computational tools (archived at https://long-read-tools.org/) for long-read sequencing in the past decade [32]. Some of these tools, for instance, Iso-Con [33], SQANTI [34], and FLAIR [35], enable the full-length detection of alternative spliced transcripts. Typically, key steps of such detection pipelines include reads error correction, subgroup clustering, reads collapsing, and isoform annotation. Currently, the study of AS analysis using long-read technique is still at its early stage, and continuous efforts are needed to reduce the high falsepositive rate of detected isoforms. And high-quality isoform annotation tools and databases are required to keep pace with the novel transcript identification. Meanwhile, accurate quantification of isoform expression is still challenging, due to the relatively low read counts and sequencing coverage biases [32]. Meanwhile, sort-read techniques provide an excellent option to improve these limitations, because it has a larger throughput, lower error rates, and are widely used for many other analyses beyond splicing. Future best practices may involve coupled analysis using both techniques [36].



Computational Refinement of Cancer-Associated Aberrant Splicing The next challenge after a successful characterization of AS events is the functional interpretation of their effects; that is, how specific events may contribute to the diverse phenotypes expected in cancer cells (Figure 1). The goal of this part of the workflow is to determine which of AS events are functionally relevant to cancer out of the full list of identified events. The first step is to extract significant changes between conditions, focusing on recurrent and robust/reproducible AS changes. One may apply different thresholds to the output of the computational splicing analysis, including thresholding the statistical q value, the absolute changes in PSI, and the median read counts across replicates. For instance, when identifying cryptic 3′ss induced by SF3B1 mutations, a minimum PSI change of 0.2 is recommended (low-abundance isoforms that confer gain-of-function or dominant-negative effect do exist, but are rare.) Ideally, we wish no cryptic reads (PSI = 0) from wild-type samples, under the assumption that an AS event will act as a perfect switch to turn on or off the carcinogenic 3′ss selection. However, after investigating the splicing patterns in more than 10 000 TCGA (The Cancer Genome Atlas) samples, one recent study found that this assumption does not reflect the biological reality. Widespread occurrence of weak cryptic 3′ss usage by many well-known targets of mutant SF3B1 (e.g., MAP3K7 and PPP2R5A) was detected in samples without SF3B1 lesions and even in normal cells [37]. This result indicates that these cryptic 3′ss are inherently active and very faintly present in normal conditions, but are dramatically elevated in cases with SF3B1 hotspot mutations. Another way to determine relevant AS events is by overlapping the identified events from different biological systems including patient data, CRISPR-based cell lines, and transgenic mouse models [38,39]. Although animal models are becoming the top choice for mechanistic studies, genetic engineering usually requires an extensive amount of experimental effort and time, and the consistency of the splicing pattern between different species must be confirmed before a comparison can be made. Functional AS events usually cause expression changes on a gene or protein level. Most instances of intron retention or poison exon result in the introduction of premature stop codons upstream of the normal stop codon. Subsequently, there is nonsense-mediated decay (NMD) of the mRNA or production of a truncated protein. Thus, significant alternative splicing changes are expected to alter the expression of target genes. This information could be integrated into the identification of a shorter list of functional AS events (due to the poor overlap between AS targets and differentially expressed genes). It is also expected that dysfunction of trans-acting splicing factors may alter the global regulatory network as a result of aberrant splicing events in key genes. Recent work [39] showed the impact of mutant SF3B1 on gene-regulatory networks by elucidating the effect of SF3B1 mutations on post-translational regulation of multiple proteins with well-established roles in tumorigenesis. Besides regulatory network analysis, previously curated cancer-associated gene sets can also be used to inform the functional effect of splicing events. A routine practice is to directly pool top-ranked AS target genes into functional enrichment analysis. Typically, the top terms involve splicing processes, like ‘mRNA splicing’, ‘mRNA processing’, ‘translation’ on the top of the output list. However, there would be few disease-relevant terms with significant q values, because sometimes only one or two key splicing perturbations would be to enough to change the activity of particular pathways. Thus, how to effectively use pathway-based information in identifying cancer-associated AS events needs to be better defined. A recent study developed a pathway enrichment-guided study of AS by correlating transcriptional signatures of cancer driver pathways with the identified AS events and established a role for MYC in regulating RNA splicing by controlling the incorporation of NMD-determinant exons in genes encoding RBPs [40]. MYC is frequently altered in cancer cells and has long been recognized to have a genetic dependency on the splicing machinery [6,41]. Targeting the spliceosome is a therapeutic vulnerability in MYC-driven cancers [42]. One recent study found that besides being a splicing regulator, MYC is also regulated by splicing errors in SF3B1-mutant cells [39].


In summary, identification of cancer-associated mis-splicing effects involves rigorous quality control of the raw AS calls to filter technical artifacts, cross validation using independent datasets or biological systems, and integration of alternative transcriptomic information, such as changes in regulatory network activity and dysregulated signaling pathways (Figure 1).



Computational Challenges in Cancer Splicing In the next four subsections, we discuss some interesting and challenging topics in cancer splicing, that can be addressed through computational approaches (Figure 2A–D). Pancancer Splicing Analysis In 2012, TCGA launched a pancancer analysis project to compare and examine the similarities and differences between the genomic and cellular alterations across 12 tumor types [43,44]. Investigating AS in a pancancer cohort is a standard computationally driven task in disclosing commonly shared and lineage-independent splicing landscapes (Figure 2A). A different analysis characterized AS across 32 TCGA cancer types from 8705 patients and identified increased neojunctions in tumors versus normal tissue, and trans-acting variants associated with AS events [28]. Another pancancer study reported a high frequency of common somatic alterations in splicing factor genes, suggesting that altered splicing may represent an underappreciated hallmark of tumorigenesis [14]. However, many fundamental questions still need to be better elucidated (see Outstanding Questions). For instance, given the low overlap of splicing defects and mutually exclusive pattern of key spliceosomal mutations, what are the convergent effects of such mutations in a single tumor type [e.g., myelodysplastic syndrome (MDS)- refractory anemia with ringed sideroblasts (RARS)] or across distinct histological cancer types? Recurrent spliceosomal mutations only happen in some specific tumor types, and are rare in others. So, is there a common process across the diverse tumor types with frequent mutations in splicing factors? Given the high number of proteins and genes involved in the splicing process, why are only a small subset of splicing factors (SF3B1, SRSF2, U2AF1, and ZRSR2) found recurrently mutated in cancer? Some genes, such as SF3B1, show different hotspot mutations in different cancers; for instance, the K700 amino acid is frequently mutated in chronic lymphocytic leukemia (CLL), but the 625 amino acid is frequently mutated in uveal melanoma (UVM); why are there cell type specific mutations, and what are their functions? One benefit of pancancer analyses is that they can increase the statistical power to identify rare mutations associated with specific splicing effects. For instance, one recent study utilized an unbiased pancancer analysis to identify mutations in another spliceosomal gene, SUGP1, that recapitulate the usage of cryptic 3′ss known to be found in mutant-SF3B1-expressing cells [37]. This also recapitulates previous biochemical studies indicating that the loss of SF3B1 interaction with SUGP1 mimics the effects of SF3B1 mutations on splicing [45]. This work on SUGP1 was supported by a recent study from another research group [46]. Such computational strategies could be applied to many other recurrent spliceosomal mutations in human cancer. Deep Learning-Based Splicing Analysis By taking advantage of an ever-increasing amount of available genomics data, deep-learning techniques have been proposed to enhance the characterization of molecular alterations improving the state-of-the-art performance for many genomics tasks, including AS analysis [47]. Improvements are quickly coming from new input data and better refinement of the biological questions, making these models increasingly accurate (Figure 2C). Recent work in this area includes deep neural network studies of alternative splicing using cis-sequence information [48,49]. The mRNA expression levels of trans-RBPs have also been incorporated as useful features to achieve a better characterization of AS in low expression target genes or when analyzing RNA-seq data with modest coverage [50]. In this study, the RBP expression profiles were obtained from knock-down experiments by the ENCODE consortium. However, most recurrent spliceosomal mutations in cancer result in change rather than loss of function. Therefore, a better fit of such models to specific cancers is to enlarge the training set by adding available datasets with change-of-function mutations in splicing factors. Most deep-learning applications rely on black-box frameworks and involve multiple layers of nonlinear combinations of raw inputs. This, as in other deep learning applications, hinders interpretability, with little or no information provided on the splicing machinery alterations associated to changes in AS. The development of interpretable deep networks will be paramount in the discovery of causal links between actions and effects in cancer splicing [51]. Modeling the Effect of Epigenetic Features on AS It has been widely accepted that epigenetic modifications regulate AS by either influencing the transcription elongation rate of RNA polymerase II (Figure 2B) or direct interactions with proteins that mark exon–intron junction of pre-mRNA [52]. Genome-wide mapping has revealed enrichment of histone modifications (for instance, H3K36me3) on exons relative to introns, which have been implicated in the regulation of alternative splicing [53] (Figure 2B). Spliceosomal proteins can likewise influence chromatin structure and histone modifications, which imply a complex feedback loop of regulation [54]. One recent study identified frequent overlap of mutations in IDH2 and SRSF2 in human acute myeloid leukemia that together promote aberrant splicing and increased DNA methylation of reduced expression of INTS3, which contributes to leukemogenesis [55]. Besides modification of DNA, RNA modifications have also been found to regulate AS. For example, perturbation of the dynamic status of the N6-methyladenosine (m6A) modification could affect interaction with SR proteins that may be involved in modulating AS [56]. However, integrating genome-wide epigenetic data with AS modeling to get a regulatory landscape between epigenetics, splicing, and cancer remains a computational challenge. Selecting appropriate datasets and methodologies (for instance, deep-learning-based methods discussed earlier) will provide a means to model the effects of epigenetics on splicing [57]. Calculating AS-Derived Neoantigens Finally, after accurate depiction and understanding of AS in cancer biology, the last important task is to develop pharmacological modulation of splicing as a therapeutic strategy (Figure 2D). Direct disruption of splicing efficiency increased sensitivity of cancer cells with spliceosomal mutations in vivo; however, some patients unfortunately exhibited unexpected side effects [58]. Immunotherapies have improved objective responses in many tumors with high burden of protein changes [59]. T cell recognition of cancers relies upon presentation of tumor-specific antigens generated by nonsynonymous mutations by MHC molecules [60] (Figure 2D). A recent study suggested that tumor-specific AS events are far more abundant than somatic SNVs [28]. A recent publication presented a computational approach to identifying neoepitopes derived from intron retention events in tumor transcriptomes, which was confirmed by mass spectrometry presented on MHC class I [27]. After the identification of high-confidence AS events, genome annotations were used to extract intronic nucleotide sequences and open reading frame orientation, and sample-specific HLA alleles were computed and examined for putative peptide– MHC I binding affinity (e.g., POLYSOLVER [61]). It is necessary to expand such analysis to full types (besides intron retention) of AS events. Such methods will be of particular interest in tumors with functional splicing changes (MDS, UVM, etc.). These observations also suggest a potential approach to activate the host antitumor immune response by coupling spliceosome inhibition (to increase the immunogenicity) with immunotherapies. Nonetheless, experimental validation of the immunogenicity of such splicing-derived neoantigens will need to be seriously assessed.


Concluding Remarks RNA splicing is a critical mediator of gene expression and regulator of proteome diversity. Alterations in splicing, including common change-of-function mutations in spliceosome genes, have been suggested to promote tumorigenesis. By utilizing quantitative cancer biology analyses, a number of computational methods have been developed and proven to play an important role in systematically identifying high-confidence AS events in a context-dependent manner. However, more efforts will be required to customize downstream computational analysis to decode the mechanistic consequences of splicing alterations in cancer pathogenesis. We suggest that coupled analysis of the impact of splicing dysfunction on the activity of gene-regulatory networks or cancer signaling pathways may help in the discovery of key functional events and guide further experimental studies. In this review, we have highlighted several research directions in cancer splicing related to or driven by computational analysis. First, we underscored that advances in cancer genomics projects (e.g., TCGA) have enabled high resolution detection and comparison of AS across a widerange of tissues in a pancancer manner. Second, we suggested the importance of incorporating epigenetic features into AS analyses. The large availability of data is enabling the development and application of in silico approaches in artificial intelligence science to increase the sensitivity and specificity of AS detection. Lastly, we suggested that the characterization of potential splicing-derived neoantigens may be leveraged with recent advances in immunotherapy to open new therapeutic avenues for AS-related tumors.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 205,033评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,725评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,473评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,846评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,848评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,691评论 1 282
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,053评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,700评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 42,856评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,676评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,787评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,430评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,034评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,990评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,218评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,174评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,526评论 2 343

推荐阅读更多精彩内容