EBSeq
- TPR relatively independent of sample size and presence of outliers.
- Poor FDR control in most situations, relatively unaffected by outliers.
- Medium computational time requirement, increases slightly with sample size.
EBSeq的输入数据是原始的read count,可以通过featureCounts、HTSeq-count等软件包获得。
安装
在服务器上安装没有报错,但是在Windows的Rstudio中安装则会出现如下报错
source("http://bioconductor.org/biocLite.R")
biocLite("EBSeq")
require(EBSeq)
载入需要的程辑包:blockmodeling
Error: package or namespace load failed for ‘blockmodeling’:
attachNamespace()里算'blockmodeling'时.onAttach失败了,详细内容:
调用: parse(con)
错误: 16:28: 意外的INCOMPLETE_STRING
15: journal = "Social Networks",
16: author = as.person("Ale
^
错误: 无法载入程辑包‘blockmodeling’
In addition: Warning message:
In parse(con) :
invalid input found on input connection 'C:/Users/cc/Documents/R/win-library/3.5/blockmodeling/CITATION'
library(blockmodeling)
#出现同样的报错
修改文件'C:/Users/cc/Documents/R/win-library/3.5/blockmodeling/CITATION'
的内容,其中对应报错的行有Aleš Žiberna
——带帽子的字符,删掉或者改成别的常见字符就OK了。当然如果担心以后出类似问题,改起来琐碎,可以直接设置R的识别语言。
Note:搜索时使用关键词INCOMPLETE_STRING
,而非最后的invalid input found on input connection
。搜索方向决定了解决问题的效率。
library(EBSeq)
载入需要的程辑包:gplots
载入程辑包:‘gplots’
The following object is masked from ‘package:stats’:
lowess
载入需要的程辑包:testthat
实战
#example
data(GeneMat)
head(GeneMat)
#[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#Gene_1 1879 2734 2369 2636 2188 9743 9932 10099 9829 9831
#Gene_2 24 40 22 27 31 118 108 144 117 113
Sizes=MedianNorm(GeneMat) #EBSeq requires the library size factor ls for each sample s. Here, ls may be obtained via the function MedianNorm, which reproduces the median normalization approach in DESeq
Conditions=as.factor(rep(c("C1","C2"),each=5)) #设置样本类型
Conditions
#[1] C1 C1 C1 C1 C1 C2 C2 C2 C2 C2
#Levels: C1 C2
EBOut = EBTest(GeneMat,Conditions=as.factor(rep(c("C1","C2"),each=5)),sizeFactors=Sizes, maxround=5)
EBDERes=GetDEResults(EBOut, FDR=0.05)
summary(EBDERes)
str(EBDERes$DEfound) #a list of genes identified with 5% FDR
head(EBDERes$PPMat) # two columns PPEE and PPDE, corresponding to the posterior probabilities of being EE or DE for each gene
str(EBDERes$Status) #EBDERes$Status contains each gene’s status called by EBSeq
实例
paste KD_In1.count KD_In2.count EGFP_In1.count EGFP_In2.count |awk '{print $1"\t"$2"\t"$4"\t"$6"\t"$8}'|sort -k 1 >combined.count
head combined.count
tail combined.count
sed -i '1,2d'combined.count #删除1-2行
sed -i '53800,$d' combined.count #删除53800-last 行
sed -i '1igene\tKD1\tKD2\tEGFP1\tEGFP2' combined.count #增加首行
data=read.table("combined.count",header=T,row.names=1)
head(data)
# EGFP1 EGFP2 KD1 KD2
#ENSMUSG00000000001 281 243 295 233
#ENSMUSG00000000003 0 0 0 0
#ENSMUSG00000000028 20 25 25 12
data=as.matrix(data)
#pre-filtering 这里仅保留了readcounts总和>=10的基因
keep <- rowSums(data[,1:4])>=10
data <- data[keep,]
#跑出结果
require(EBSeq)
Sizes=MedianNorm(data)
EBOut = EBTest(data,Conditions=as.factor(rep(c("EGFP","KD"),each=2)),sizeFactors=Sizes, maxround=15) #maxround迭代次数,一般要求最后结果趋于收敛比较可信
EBDERes=GetDEResults(EBOut, FDR=0.05)
summary(EBDERes)
str(EBDERes$DEfound) #a list of genes identified with 5% FDR
head(EBDERes$PPMat) # two columns PPEE and PPDE, corresponding to the posterior probabilities of being EE or DE for each gene
# PPEE PPDE
#ENSMUSG00000000001 1.0000000 3.203313e-17
#ENSMUSG00000000003 NA NA
str(EBDERes$Status) #EBDERes$Status contains each gene’s status called by EBSeq
#输出DEG(FDR<0.05)
write.table(EBDERes$DEfound,file="DEG_FDR005.txt",quote=F,sep="\t",row.names=F,col.names=F)
#计算 Fold Change (FC) of the raw data as well as the posterior FC of the normalized data.
GeneFC=PostFC(EBOut)
str(GeneFC)
PlotPostVsRawFC(EBOut,GeneFC) #FC vs.Posterior FC的图
write.table(GeneFC,file="FC.txt",quote=F,sep="\t") #注意$Direction
- 如何确定maxround?
查看参数值:EBOutP, EBOut$Beta,最后两个值相差<0.01就差不多了。
关于差异表达(RNA-seq)
标准化方法不同+检验不同=多种组合/软件,用之前需要结合自己的样本量来考虑,多参考有相似实验设计的文献,常用的方法都跑一下,自己评估下结果差异,再做定夺。(研究本来就是充满了不确定性,一切都只能用“可能性”来定义,所以,采用同样参数仍然无法完全重复出文献中的结果也是常见。)
参考资料: