在Linux上跑GSEA只是工程需要,在Windows上不能批量的跑;在单细胞数据上跑GSEA那是客户的需要。有需要就需要被满足,我们今天介绍一下如何提取Seurat数据做GSEA。这里的GSEA指的是GSEA官网软件,而不是fgsea, clusterProfiler等,该软件是由JAVA写的,所以应该安装合适的JAVA,为了能看懂一些常见的报错,建议学点JAVA(半年时间过去了)。
java安装与否
which java
/usr/bin/java
/usr/bin/java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
gsea 安装与否
检查gsea是否安装成功
java -Xmx512m -cp gsea-3.0.jar xtools.gsea.Gsea -h
xtools.api.param.MissingReqdParamException:
Some required parameters (3) were not specified. The parameters are:
>gmx< Gene sets database (gmx or gmt files - one or more are allowed)
>res< Expression dataset - with rows as genes and columns as samples (for instance: res, gct, pcl files)
>cls< Phenotype labels for the samples in the expression dataset (cls file)
at xtools.api.param.ToolParamSet.check(ToolParamSet.java:340)
at xtools.api.AbstractTool.init(AbstractTool.java:169)
at xtools.api.AbstractTool.init(AbstractTool.java:193)
at xtools.gsea.Gsea.<init>(Gsea.java:51)
at xtools.gsea.Gsea.main(Gsea.java:139)
选择一个通路(gene set)
在整理这份资料的时候,发现关于GSEA的大部分疑难杂症我们生信技能树都有文章,正所谓:
生信教程技能树
入门生信谁敢坑
可参见:
批量下载GSEA基因集
制作自己的gene set文件给gsea软件
提示:通路名字中尽量不要有/
,如GLYCOLYSIS / GLUCONEOGENESIS
,因为在Linux中/
意味着路径,在输出文件中有以通路名命名的文件。细胞命名也最好不要有这个符号呀。
提取数据做GSEA
SeuratRunGSEA<- function(seuo,group,test1,test2,gmt,outdir,testname){
Idents(seuo) <- group
seuo <- subset(seuo,idents=c(test1,test2))
seuo@assays$RNA@counts-> counts # 是用count还是data?
expr_data <- as.matrix(counts) # 数据量太大怎么办?可以取 pseudocell 啊,见参考。
# 整理GSEA需要的表达谱格式
write.table(rbind(c('symbols',colnames(expr_data)),
cbind(rownames(expr_data),expr_data)),
file='expr.txt',quote=F,sep='\t',col.names=F,row.names=F)
# 整理GSEA需要的分组格式
pheno<-as.character(seuo@meta.data[,group]) # 注意顺序
con<-file('pheno.cls',open='w')
write(paste(length(pheno),'2 1'),con)
write(paste('# ',test1,' ',test2,sep=''),con)
classes<-''
for (i in 1:length(pheno)){
classes<-paste(classes,pheno[i])
}
write(classes,con)
close(con)
# GSEA 命令
command <- paste('java -Xmx512m -cp gsea-3.0.jar xtools.gsea.Gsea -res expr.txt -cls pheno.cls#',test1,'_versus_',test2,' -gmx ',gmt,
' -collapse false -nperm 1000 -permute gene_set -rnd_type no_balance -scoring_scheme weighted -rpt_label ',testname,
' -metric Diff_of_Classes -sort real -order descending -include_only_symbols false -make_sets true -median false -num 100',
' -plot_top_x 20 -rnd_seed 123456 -save_rnd_lists false -set_max 10000 -set_min 5 -zip_report false -out ', outdir, ' -gui false',sep='')
system(command)
com<-file('command.log',open='w')
write(command,com) # 保存运行的命令,以便在命令行中调参
close(com)
}
照例,请出pbmc3k数据集,测试一番:
library(Seurat)
library(SeuratData)
gmt <- "c8.all.v7.2.symbols.gmt"
outdir = './'
base.name = './'
test1='B'
test2 = 'NK'
testname='seurat_annotations'
pbmc3k
An object of class Seurat
13714 features across 2700 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
> head(pbmc3k@meta.data)
orig.ident nCount_RNA nFeature_RNA seurat_annotations
AAACATACAACCAC pbmc3k 2419 779 Memory CD4 T
AAACATTGAGCTAC pbmc3k 4903 1352 B
AAACATTGATCAGC pbmc3k 3147 1129 Memory CD4 T
AAACCGTGCTTCCG pbmc3k 2639 960 CD14+ Mono
AAACCGTGTATGCG pbmc3k 980 521 NK
AAACGCACTGGTAC pbmc3k 2163 781 Memory CD4 T
运行:
SeuratRunGSEA(pbmc3k,testname,test1,test2,gmt,outdir,testname)
结果在seurat_annotations.Gsea.XXXXXXXXXXXXX下,导出,用浏览器打开index.html
文件即可:
剩下的就是结合生物学意义解读结果了,网上有许多的介绍,这砖不搬了。
两个参数注意一啊,我们用的都是默认的:
-create_svgs <Boolean>
Create SVG plot images along with PNGs (GZ compressed to save space as these are very large)
Default: false
Hints : true,false
-create_gcts <Boolean>
Create GCT files for the data backing the Gene Set Enrichment Heatmaps
Default: false
Hints : true,false
preRank 模式
对应细胞数较多的情况。
runGSEA_preRank<-function(seuo,group,test1,test2,gmt,testname){
set.seed(1314)
Idents(seuo) <- group
mk <- FindMarkers(seuo,ident.1 = "C")
mk$gene <- rownames(mk)
mk %>% filter(p_val_adj >0) ->mk1
mk$p_val_adj[which(mk$p_val_adj==0)] <- min(mk1$p_val_adj)
mk %>% mutate(sign = ifelse(avg_logFC>0,1,-1),Pstatistics=-1*log(p_val_adj)*sign) -> mk
preRank.matrix<- mk[,c("gene","Pstatistics")]
prerank <- paste0(testname,'_prerank.rnk')
write.table(preRank.matrix,
file=prerank,
quote=F,
sep='\t',
col.names=F,
row.names=F)
#call java gsea version
command <- paste('java -Xmx512m -cp gsea-3.0.jar xtools.gsea.GseaPreranked -gmx ', gmt, ' -norm meandiv -nperm 1000 -rnk ', prerank ,
' -scoring_scheme weighted -make_sets true -rnd_seed 123456 -set_max 500 -set_min 1 -zip_report false ',
' -out preRankResults -create_svgs true -gui false -rpt_label ',testname, sep='')
system(command)
com<-file(paste0(testname,'_command.log'),open='w')
write(command,com)
close(com)
}
Known_Issues
批量运行GSEA,命令行版本
https://ccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243course/2013-sweetcorderonotes.pdf
http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Using_RNA-seq_Datasets_with_GSEA
Metabolic landscape of the tumor microenvironment at single cell resolution
单细胞转录组中的pseudocell又是什么
如何实现GSEA-基因富集分析?
GSEA-基因富集分析