GO富集分析-对比Gorilla, clusterProfiler, topGO三种工具part1

实在懒得找meme了


GOrilla, clusterProfiler, topGO 分析 ranked gene list (GSEA) 对比

Genelist preparation

保证了三个工具输入的基因是一致的。

# genelist preparation
library(hgu133plus2.db)
rankedprobe <- row.names(DEGdf)
load("GSE76275_DEGs.Rdata")
## ranked entrezid by pval
rankedentrezt <- AnnotationDbi::select(hgu133plus2.db,
                                      rankedprobe,
                                      "ENTREZID",
                                      "PROBEID")
rankedentrez <- unique(na.omit(rankedentrezt$ENTREZID))

for GOrilla

纠正一个误解,之前以为 GOrilla 要求的 gene list 是按表达量从高到低排序。

This is particularly useful in many typical cases where genomic data may be naturally represented as a ranked list of genes (e.g. by level of expression or of differential expression).

但实际上是依然是基于t检验,也就是基于p值。

All genes were ranked according to how well they differentiate between the two groups using a simple t-test. The top of the list contained the genes that were the best separators between the two groups.

GOrilla 的优势主要体现在:

  • 与基于超几何分布的工具相比——阈值灵活
  • 与其他 阈值灵活/类似GSEA/利用Kolmogorov-Smirnov检验 的工具相比:
    • 专注于GO,可直接生成DAG
    • 利用 "mHG " 可得到精确的 p.val
    • 速度快
    • (似乎放在现在这些都不是什么问题,毕竟 GSEA 2005年发布,GOrilla 2008年发布...不过一布拿到图+表也是很爽的额 (。﹏。)
## ranked symbol (increasing by pval)
rankedsymbol <- AnnotationDbi::select(hgu133plus2.db,
                                       rankedentrez,
                                       "SYMBOL",
                                       "ENTREZID")[,2]
save(rankedsymbol, file = "rankedsymbol_pvbased_forGOrilla.Rdata")
write.table(rankedsymbol, file = 
              "rankedsymbol_pvbased_forGOrilla.txt",
            quote = F, col.names = F, row.names = F)

手动选择 pvalue cutoff

for clusterProfiler

## ranked entrezid with pval (increasing by pval)
matchedprobe <- match(rankedentrez,
                      rankedentrezt$ENTREZID)
matchedprobe2 <- rankedentrezt[matchedprobe,][,"PROBEID"]
respon.pval <- DEGdf[matchedprobe2,]$P.Value
entrezlist_pval <- respon.pval
names(entrezlist_pval) <- rankedentrez
save(entrezlist_pval, file = "genelist_ENTREZID_pVal.Rdata")

### decreasing entrezid list by pval
entrezlist_dcrpv <- sort(entrezlist_pval, decreasing = T)
save(entrezlist_dcrpv, file = "genelist_ENTREZID_Decr_pVal.Rdata")

### decreasing entrezid list by FC
entrezlist_FC <- DEGdf[matchedprobe2,]$logFC
names(entrezlist_FC) <- rankedentrez
entrezlist_FC <- sort(entrezlist_FC, decreasing = T)
save(entrezlist_FC, file = "genelist_ENTREZID_FC.Rdata")

for topGO

## probeid with p.val list (increasing by pval)
probelist_pval <- respon.pval
names(probelist_pval) <- matchedprobe2
save(probelist_pval, file = "genelist_PROBEID_pVal.Rdata")

GOrilla

Running mode: Single ranked list of genes

很酷炫的一张

重复基因会被自动删除

但并不知道哪来的这12个 duplicate genes (+_+)?

table(table(rankedsymbol))
# 
#     1 
# 22012 

Enrichment (N, B, n, b) is defined as follows:
N - is the total number of genes
B - is the total number of genes associated with a specific GO term
n - is the number of genes in the top of the user's input list or in the target set when appropriate
b - is the number of genes in the intersection
Enrichment = (b/n) / (B/N)

缺点是并没有显示富集到了多少个GO term, 强行扒下这个表后,小小操作:

GOrillares <- read.csv("GOrillatable.txt", header = T, sep = "\t")
save(GOrillares, file = "GOrillatable.Rdata")

所以就是有100个 term.

nrow(GOrillares)
# [1] 100

clusterProfiler

GSEA GO-BP

标准的 GSEA.

关于输入的 gene list, y叔是这么说的:

The geneList contains three features:

  1. numeric vector: fold change or other type of numerical variable
  2. named vector: every number was named by the corresponding gene ID
  3. sorted vector: number should be sorted in decreasing order

但用 p.val 的效果似乎一般。

## clusterProfiler GSEA GO-BP
library(clusterProfiler)
library(hgu133plus2.db)
load("genelist_ENTREZID_Decr_pVal.Rdata")

### with decreasing p.val
gseaGO_BP <- gseGO(geneList     = entrezlist_dcrpv,
                   OrgDb        = hgu133plus2.db,
                   keyType      = "ENTREZID",
                   ont          = "BP",
                   nPerm        = 1000,   ## 排列数
                   minGSSize    = 5,
                   maxGSSize    = 500,
                   pvalueCutoff = 0.95,
                   verbose      = TRUE)  ## 不输出结果
gseaGO_BPresult <- gseaGO_BP@result
save(gseaGO_BPresult, file = "cp_gseaGO_BPresult.Rdata")

可能和网络也有关系?几次输入同样的参数,得到的结果并不一样....有时是报错 no term, 迷惑desu


也没什么显著的 term

还是像示例里那样用 decreasing FC

### with decreasing FC
load("genelist_ENTREZID_FC.Rdata")
gseaGO_BP2 <- gseGO(geneList     = entrezlist_FC,
                    OrgDb        = hgu133plus2.db,
                    keyType      = "ENTREZID",
                    ont          = "BP",
                    nPerm        = 1000,   ## 排列数
                    minGSSize    = 5,
                    maxGSSize    = 500,
                    pvalueCutoff = 0.05,
                    verbose      = TRUE)  ## 不输出结果
gseaGO_BPresult2 <- gseaGO_BP2@result
save(gseaGO_BPresult2, file = "cp_gseaGO_BPresult2.Rdata")

结果果然正常多了

topGO

GSEA-like GO-BP (Using the genes score )

准确地说,也许不能直接叫做 GSEA 法。topGO 说明书里的说法是:

We will use two types of test statistics: Fisher's exact test which is based on gene counts, and a Kolmogorov-Smirnov like test which computes enrichment based on gene scores.

并且:

We can use both these tests since each gene has a score (representing how di erentially expressed a gene is) and by the means of topDiffGenes functions the genes are categorized into di erentially expressed or not di erentially expressed genes.

正因为有 topDiffGenes 这个函数的存在,相当于无形中给了 target list, 而 allGenes = 相当于 universe gene list. 所以这也许是 topGO 无论用类似 ORA 或 类似 GSEA 法构建的 topGOdata ,算法和检验方法可以各种混用的原因。(maybe <@_@>

## topGO GSEA GO-BP
library(topGO)
library(hgu133plus2.db)
load("genelist_PROBEID_pVal.Rdata")

## function: topDiffGenes
topDiffGenes <- function(allScore) {
  return(allScore < 0.01)   ## p.val < 0.01
}
sum(topDiffGenes(probelist_pval))
# [1] 12210

## construct a topGOdata object
GSEAGOdata_BP <- new("topGOdata",
                     ontology = "BP",
                     allGenes = probelist_pval,
                     geneSel = topDiffGenes,
                     nodeSize = 5,
                     annot = annFUN.db,
                     affyLib = "hgu133plus2.db")
## Kolmogorov-Smirnov like test - classic
KSres <- runTest(GSEAGOdata_BP, algorithm = "weight01", statistic = "ks") 
KSres
# 
# Description:  
# Ontology: BP 
# 'weight01' algorithm with the 'ks' test
# 9106 GO terms scored: 203 terms with p < 0.01
# Annotation data:
#     Annotated genes: 16454 
#     Significant genes: 9542 
#     Min. no. of genes annotated to a GO: 5 
#     Nontrivial nodes: 9106 
GSEA_BPres <- GenTable(GSEAGOdata_BP,
                       weight01KS = KSres,
                       orderBy = "weight01KS",
                       ranksOf = "weight01KS",
                       topNodes = 203)
save(GSEA_BPres, file = "topGO_gseaGO_BPresult.Rdata")

结果对比

先找出三种工具结果中共有的 GO term:

nrow(GOrillares)  ## GOrilla
# [1] 100
nrow(gseaGO_BPresult2)  ## clusterProfiler
# [1] 457
nrow(GSEA_BPres)  ## topGO
# [1] 203
GOrillavscp <- intersect(GOrillares$GO.term, gseaGO_BPresult2$ID)
GOrillavstop <- intersect(GOrillares$GO.term, GSEA_BPres$GO.ID)
triple <- intersect(GOrillavscp, GOrillavstop)
triple
# [1] "GO:0050867"

只有一个😳

row.names(GOrillares) <- GOrillares$GO.term
row.names(GSEA_BPres) <- GSEA_BPres$GO.ID
theonly <- cbind(GOrillares[triple,],
                 gseaGO_BPresult2[triple,],
                 GSEA_BPres[triple,])
finetable <- subset(theonly, select = c("Description",
                    "Enrichment..N..B..n..b.", "P.value","FDR.q.value",
                    "setSize","enrichmentScore","pvalue", "p.adjust","qvalues", 
                    "Annotated", "Significant","weight01KS"))
finetable
#                                       Description Enrichment..N..B..n..b.
# GO:0050867 positive regulation of cell activation  35.01 (17224,328,12,8)
#             P.value FDR.q.value setSize enrichmentScore     pvalue
# GO:0050867 7.67e-11    6.28e-08     312       0.2948657 0.00203252
#              p.adjust    qvalues Annotated Significant weight01KS
# GO:0050867 0.04044123 0.03411746       313         174    0.00383

看起来, GOrilla 的结果和常用的两个R包的结果差的还是挺多的😲

出于好奇看一下 topGO 和 clusterProfiler 结果的差别

topvscp <- intersect(GSEA_BPres$GO.ID, gseaGO_BPresult2$ID)
length(topvscp)
# [1] 19

大概就是这个亚子。


最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,099评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,828评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,540评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,848评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,971评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,132评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,193评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,934评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,376评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,687评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,846评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,537评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,175评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,887评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,134评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,674评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,741评论 2 351

推荐阅读更多精彩内容