曾经我用 rMATS 进行可变剪切分析,现在我觉得 IsoformSwitchAnalyzeR 更香。
IsoformSwitchAnalyzeR 功能总结如下。
In summary, IsoformSwitchAnalyzeR enables annotation of isoforms with intron retention, ORF, NMD sensitivity, coding potential, protein domains and signal peptides (and many more), resulting in the ability to predict important functional consequences of isoform switches in both individual genes and on a genome wide level.
注:
- ORF: Open Reading Frame
- NMD: Non-sense Mediated Decay. 真核生物消除提前出现终止密码的错误 mRNA 机制。
- IDP: intrinsically disordered protein. 无固定三维结构的蛋白。传统上大家认为蛋白功能依赖于稳定的三维结构,但是IDP缺乏稳定结构,一些IDP与其他大分子结合后会有固定的三维结构。 总体而言,IDP在许多方面与结构蛋白不同,并且倾向于具有独特的功能,结构,序列,相互作用,进化和调控。
总而言之 IsoformSwitchAnalyzeR 不仅仅是分析可变剪切事件,还整合了蛋白结构等信息用于判定可变剪切是否影响蛋白功能,同时包含了基因组整体的分析。
本文用 RNA-seq 数据过一遍 IsoformSwitchAnalyzeR 分析流程。输入数据是 Salmon 定量结果,不知道 Salmon 如何用的同学赶紧去看 《Salmon 进行转录本定量》。
用 BiocManager 安装。
BiocManager::install("IsoformSwitchAnalyzeR")
library(IsoformSwitchAnalyzeR, quietly=TRUE)
导入 Salmon 定量数据。Salmon 每个样本结果在单独目录,这里 parentDir
是这些目录的上级目录,能包含所有结果。
> isoformExpr <- importIsoformExpression(parentDir="/ExamplePath/S
almon", pattern="quant.sf")
Step 1 of 3: Identifying which algorithm was used...
The quantification algorithm used was: Salmon
Found 8 quantification file(s) of interest
Step 2 of 3: Reading data...
reading in files with read_tsv
1 2 3 4 5 6 7 8
Step 3 of 3: Normalizing FPKM/TxPM values via edgeR...
Done
导入后会被保存到列表对象。包含每个转录本的 read counts 和丰度 TxPM 以及有效长度等信息。
> str(isoformExpr)
List of 4
$ abundance :'data.frame': 227063 obs. of 9 variables:
..$ isoform_id: chr [1:227063] "ENST00000456328.2" "ENST00000450305.2" "ENST00000488147.1"
"ENST00000619216.1" ...
..$ 3-13 : num [1:227063] 0.198 0 10.598 0 0 ...
..$ 3-16 : num [1:227063] 0.122 0 0.826 0 0 ...
..$ 3-17 : num [1:227063] 0 0 6.57 0 0 ...
..$ 3-2 : num [1:227063] 0.345 0 5.916 0 0 ...
..$ O_1 : num [1:227063] 0.173 0 8.588 0 0 ...
..$ O_2 : num [1:227063] 0.0979 0 10.6988 0 0 ...
..$ O_3 : num [1:227063] 0 0 10.9 0 0 ...
..$ O_4 : num [1:227063] 0 0 9.26 0 0 ...
$ counts :'data.frame': 227063 obs. of 9 variables:
..$ isoform_id: chr [1:227063] "ENST00000456328.2" "ENST00000450305.2" "ENST00000488147.1"
"ENST00000619216.1" ...
..$ 3-13 : num [1:227063] 4.83 0 258.58 0 0 ...
..$ 3-16 : num [1:227063] 1.11 0 7.56 0 0 ...
..$ 3-17 : num [1:227063] 0 0 72.9 0 0 ...
..$ 3-2 : num [1:227063] 5.7 0 97.7 0 0 ...
..$ O_1 : num [1:227063] 3.58 0 177.87 0 0 ...
..$ O_2 : num [1:227063] 2.4 0 261.7 0 0 ...
..$ O_3 : num [1:227063] 0 0 206 0 0 ...
..$ O_4 : num [1:227063] 0 0 158 0 0 ...
$ length :'data.frame': 227063 obs. of 9 variables:
..$ isoform_id: chr [1:227063] "ENST00000456328.2" "ENST00000450305.2" "ENST00000488147.1"
"ENST00000619216.1" ...
..$ 3-13 : num [1:227063] 1497.8 472.9 1191.8 11.1 552.8 ...
..$ 3-16 : num [1:227063] 1515.1 490.1 1209.1 10.7 570.1 ...
..$ 3-17 : num [1:227063] 1504.9 479.9 1198.9 10.1 559.9 ...
..$ 3-2 : num [1:227063] 1495.8 470.9 1189.8 10.4 550.8 ...
..$ O_1 : num [1:227063] 1501.2 476.3 1195.2 10.2 556.2 ...
..$ O_2 : num [1:227063] 1500.2 475.2 1194.2 10.3 555.2 ...
..$ O_3 : num [1:227063] 1494.8 469.8 1188.8 10.7 549.8 ...
..$ O_4 : num [1:227063] 1496.2 471.3 1190.2 10.8 551.2 ...
$ importOptions:List of 3
..$ calculateCountsFromAbundance: logi TRUE
..$ interLibNormTxPM : logi TRUE
..$ normalizationMethod : chr "TMM"
用 importRdata
函数整合所有需要的信息,包括表达数据、实验设计(Design Matrix)、注释信息等。
# 先构建 design matrix
> samples <- c("3-13", "3-16", "3-17", "3-2", "O_1", "O_2", "O_3", "O_4")
> condictions <- c(rep_len("Treatment", 4), rep_len("Control", 4))
> designM <- data.frame(sampleID=samples, condition=condictions)
> designM
sampleID condition
1 3-13 Treatment2
2 3-16 Treatment
3 3-17 Treatment4
4 3-2 Treatment
5 O_1 Control
6 O_2 Control
7 O_3 Control
8 O_4 Control
# 构建 switchAnalyzeRlist 对象
> switchList <- importRdata(isoformCountMatrix=isoformExpr$counts, isoformRepExpression=isoformExpr$abundance, designMatrix=designM, isoformExonAnnoation="/ExamplePath/GRCh38_hg38/gencode.v33.annotation.gtf", isoformNtFasta="/ExamplePath/GRCh38_hg38/gencode.v33.transcripts.fa")
Step 1 of 6: Checking data...
Step 2 of 6: Obtaining annotation...
importing GTF (this may take a while)
Step 3 of 6: Calculating gene expression and isoform fraction...
99190 ( 43.89%) isoforms were removed since they were not expressed in any samples.
Step 4 of 6: Merging gene and isoform expression... |======================================================================| 100%Step 5 of 6: Making comparisons... |======================================================================| 100%
Step 6 of 6: Making switchAnalyzeRlist object...
Done
这里 isoformNtFasta
是每个转录本的 fasta 序列文件,在 Salmon 建立索引时也要用到。完成后返回 switchAnalyzeRlist
对象。
进行过滤,减少下游分析计算量。
> switchListF <- preFilter(switchList)
The filtering removed 65336 ( 51.53% of ) transcripts. There is now 61459 isoforms left
分析差异表达的 isoform (差异可变剪切分析)。
> switchListD <- isoformSwitchTestDEXSeq(switchAnalyzeRlist=switchListF, alpha=0.05)
Step 1 of 2: Testing each pairwise comparisons with DEXSeq (this might be a bit slow)... Estimated time (for dataset with ~30.000 isoforms): 1.7 min
Step 2 of 2: Integrating result into switchAnalyzeRlist...
Isoform switch analysis was performed for 11580 gene comparisons (100%).
Done
这里 alpha
是调整后 P 值阈值。
ORF 分析和取得序列,包括将 ORF 序列翻译到氨基酸。
> switchListO <- analyzeORF(switchListD)
Step 1 of 3 : Extracting transcript sequences...
Step 2 of 3 : Locating ORFs of interest...
|======================================================================| 100%
Step 3 of 3 : Scanning for PTCs...
683 putative ORFs were identified and analyzed
Done
> switchListS <- extractSequence(switchListO, pathToOutput="/ExamplePath/IsoformSequence")
Step 1 of 3 : Extracting transcript nucleotide sequences...
Step 1 of 3 : Extracting ORF AA sequences...
Step 2 of 3 : Preparing output...
Done
> extractSwitchSummary(switchListS)
Comparison nrIsoforms nrGenes
1 Control vs Treatment
文件将被输出到 pathToOutput
指定目录,输出的文件可以供 CPAT, Pfam, SignalP 等分析。这些软件的分析结果,接下来整合到 switchAnalyzeRlist
对象里。
注:到这里就用输出的序列文件自行做 CPAT, Pfam, SignalP 分析,取得这些分析结果后再往下继续可变剪切分析。当然你也可以不做这些分析或只做部分,我这里做了 CPAT 和 Pfam 没做 SignalP 分析。
> switchListC <- analyzeCPAT(switchListS, pathToCPATresultFile="/ExamplePath/IsoformSequence/CPAT.txt", codingCutoff=0.364, removeNoncodinORFs=FALSE)
Added coding potential to 723 (100%) transcripts
> switchListP <- analyzePFAM(switchListC, pathToPFAMresultFile="/ExamplePath/IsoformSequence/isoformSwitchAnalyzeR_isoform_AA.out.fa")
Converting AA coordinats to transcript and genomic coordinats...
|======================================================================| 100%
Added domain information to 374 (51.73%) transcripts
进行可变剪切分析。
> switchListA <- analyzeAlternativeSplicing(switchListP)
Step 1 of 3: Massaging data...
Step 2 of 3: Analyzing splicing...
|======================================================================| 100%
Step 3 of 3: Preparing output...
Done
> switchListCQ <- analyzeSwitchConsequences(switchListA, consequencesToAnalyze=c("intron_retention", "coding_potential", "ORF_seq_similarity", "NMD_status", "domains_identified"))
Step 1 of 4: Extracting genes with isoform switches...
Step 2 of 4: Analyzing 126 pairwise isoforms comparisons...
|======================================================================| 100%
Step 3 of 4: Massaging isoforms comparisons results...
Step 4 of 4: Preparing output...
Identified genes with containing isoforms switching with functional consequences...
这里 consequencesToAnalyze
根据已经整合的注释数据进行选择,我这里整合了 CPAT 结果就可以加入 coding_potential
分析。没有整合 SignalP 结果,就不能进行 signal_peptide_identified
分析。
输出可变剪切结果图, n
是数目 Inf
代表输出所有结果。
> switchPlotTopSwitches(switchListCQ, n=Inf, pathToOutput="/ExamplePath/ASEvents")
Extracting data...
Creating 99 plots...
|======================================================================| 100%
Made 99 plots of genes with isoform switching
输出所有的可变剪切事件后,可以查看自己感兴趣的基因/事件。下面是我这次结果的几个图。从图中可以查看每个 isoform 结构以及被样本使用情况以及组间是否显著差异。
下面的函数可以查看基因组总体的可变剪切分析。
> extractSwitchSummary(switchListCQ)
Comparison nrIsoforms nrGenes
1 Control vs Treatment 110 99
> pdf("/ExamplePath/ASEvents/SplicingSummary.pdf")
> print(extractSplicingSummary(switchListCQ))
> print(extractSplicingEnrichment(switchListCQ))
> print(extractSplicingGenomeWide(switchListCQ))
SplicingSummary 图示例:
第二个图画的是 prop.test
函数结果。表明 gain/loss 占比是否显著的改变。0.5表示占比没有改变。
各简写含义:
- IR : Intron Retention.
- A5 : Alternative 5’ donor site (changes in the 5’end of the upstream exon).
- A3 : Alternative 3’ acceptor site (changes in the 3’end of the downstream exon).
- ATSS : Alternative Transcription Start Site.
- ATTS : Alternative Transcription Termination Site.
- ES : Exon Skipping.
- MES : Multiple Exon Skipping. Skipping of >1 consecutive exons.
- MEE : Mutually Exclusive Exons.