GSEA涉及到很多文件格式,参考:GSEA数据格式。
在《GTEx: 如何找组织特异表达的基因》中,已经讨论了如何求组织特异表达的基因集,这里直接使用。分析内容参考文献:Tissue-specific transcription reprogramming promotes liver metastasis of colorectal cancer:
- The
log 2 ratios
were computed for several datasets with normalized expression data in different conditions; - the
preranked
Gene Set Enrichment Analysis was performed with
tissue-associated gene sets and dataset with default parameters
1. 准备基因集gmt文件
#去掉交集
Liver_specific_gene <- setdiff(Liver$Description,Colon_Sigmoid$Description)
Colon_Sigmoid_specific_gene <- setdiff(Colon_Sigmoid$Description,Liver$Description)
##转为GMT格式
df_forGMT <- data.frame(
Liver=c("Liver_specific",Liver_specific_gene),
Colon=c("Colon_specific",Colon_Sigmoid_specific_gene,rep("NANANANA",242))
)
write.table(t(df_forGMT),"Liver_Colon_specific.gmt",quote=F,sep="\t",row.names=T,col.names=F)
$ sed -i 's/\tNANANANA//g' Liver_Colon_specific.gmt
2. 在GEO上下载表达矩阵,并整理
less -S GSE50760_series_matrix.txt | grep "Sample_title" | sed 's/\t/\n/g' | grep "Sample_title" -v | sed 's/"//g' > sample.info
paste *.1_FPKM.txt *.3_FPKM.txt | awk '{for (i=2;i<=72;i=i+2) printf("%s\t",$i);printf("\n")}' > tmp1
less -S GSM1228184_AMC_2.1_FPKM.txt | cut -f1 > tmp2
paste tmp2 tmp1 > gene_sample_fpkm.matrix.txt #这个文件第1列可能有乱七八糟的名称,并且还可能重复,需要自己过滤一下
less -S gene_sample_fpkm.matrix.txt | cut -f1 |sort | uniq -c | head -n 4
1 01-Dec
2 01-Mar
1 01-Sep
2 02-Mar
3. 准备RNK文件
#准备数据
gene_list <- read.table("gene_sample_fpkm_filter.matrix.txt",header = T,row.names = 1,stringsAsFactors = F)
averaged_fpkm <- data.frame(
primary_CRC=rowSums(gene_list[,1:18])/18,
liver_met=rowSums(gene_list[,19:36])/18
)
averaged_fpkm$sum <- averaged_fpkm$liver_met+averaged_fpkm$primary_CRC
averaged_fpkm$Name <- rownames(averaged_fpkm)
##过滤一部分表达值很低的基因
averaged_fpkm <- averaged_fpkm%>%filter(sum>0.1) #0.1是我想当然的一个值
averaged_fpkm$logFC <- log2(averaged_fpkm$liver_met/averaged_fpkm$primary_CRC)
averaged_fpkm <- averaged_fpkm%>%filter(averaged_fpkm$logFC != "NaN")
averaged_fpkm <- averaged_fpkm%>%filter(abs(averaged_fpkm$logFC) != "Inf") #NaN, Inf, -Inf都不合理,需要滤掉
averaged_fpkm$rank <- rank(averaged_fpkm$logFC)
averaged_fpkm <- arrange(averaged_fpkm,desc(rank))
##保存为RNK文件
write.table(as.data.frame(averaged_fpkm[,4:5]),"averaged_fpkm.rnk",quote=F,sep="\t",row.names=F,col.names=F)
4. 使用桌面版的GSEA
下载链接:http://software.broadinstitute.org/gsea/downloads.jsp
桌面版的使用很简单,并且报错提示信息很丰富,不怕出问题。
4.1 使用prerank模式
NES: 2.61; FDR q-val: 0.000。
4.2 一般用法
需要三个文件:基因集gmt、表型文件cls、基因表达文件gct
gct
cls
gmt
需要注意获取gct文件不能用原始表达矩阵,原始表达矩阵的两种class求ratio可能得到正负无穷和NaN,需要提前过滤一下。
得到的富集图形跟上面那个一样,就不放了。NES: 1.24, FDR q-val: 0.043。
5. 使用Python版的GSEA--GSEAPY
conda install -c bioconda gseapy
PS C:\Users\15927\Desktop\tmp> gseapy prerank -r .\averaged_fpkm.rnk -g .\Liver_Colon_specific.gmt -l met pri -o prerank_report_tmp --max-size 1000
优点是图形上增加了统计值