一、获得注释包中的GO信息
library(GOSemSim)
library(org.Hs.eg.db)
hsGO_BP<-godata("org.Hs.eg.db",keytype = "ENTREZID",ont = "BP",computeIC = T)
二、下载最新的KEGG数据库
download_KEGG("mmu",keyType = "ncbi-geneid") ->res
三、将GOid转化为term
go2term("GO:0002576")
四、计算词条相似性
计算相似性的原理可以参考:https://yulab-smu.top/biomedical-knowledge-mining-book/semantic-similarity-overview.html
原文中会有词条、基因和基因集合的语义相似性分析,这里只说词条相似性
library(GOSemSim)
library(org.Hs.eg.db)
hsGO_BP<-godata("org.Hs.eg.db",keytype = "ENTREZID",ont = "BP",computeIC = T)
#随机取1000个term
set.seed(2022)
gobp_sample<-sample(hsGO_BP@geneAnno$GO%>%unique(),1000)
#-----------------------------------------------------------------------------
system.time(
{
gobp_similarity_wang<-mgoSim(gobp_sample,gobp_sample,hsGO_BP,measure = "Wang",combine = NULL)
}
)
#-----------------------------------------------------------------------------
system.time(
{
gobp_similarity_lin<-mgoSim(hsGO_BP@geneAnno$GO%>%unique(),hsGO_BP@geneAnno$GO%>%unique(),hsGO_BP,measure = "Lin",combine = NULL)
}
)
#-----------------------------------------------------------------------------
system.time(
{
gobp_similarity_rel<-mgoSim(gobp_sample,gobp_sample,hsGO_BP,measure = "Rel",combine = NULL)
}
)
#-----------------------------------------------------------------------------
system.time(
{
gobp_similarity_Resnik<-mgoSim(gobp_sample,gobp_sample,hsGO_BP,measure = "Resnik",combine = NULL)
}
)
#-----------------------------------------------------------------------------
system.time(
{
gobp_similarity_jiang<-mgoSim(gobp_sample,gobp_sample,hsGO_BP,measure = "Jiang",combine = NULL)
}
)
这些结果可以用来对GO term去冗余
#-----------------------------------------------------------------------------
#对前100做统计
gobp_similarity_wang_filter<-gobp_similarity_wang[1:100,1:100]
gobp_similarity_wang_filter[gobp_similarity_wang_filter<0.3]=0
rownames(gobp_similarity_wang_filter)<-rownames(gobp_similarity_wang_filter)%>%sapply(.,function(x){go2term(x)%>%pull(Term)})
gobp_similarity_wang_filter%>%pheatmap()
当然在这个包内本身存在一个函数simplify
专门用来去冗余,其原理为:
simplify <- function(enrichResult, cutoff=0.7, by="p.adjust", select_fun=min) {
## GO terms that have semantic similarity higher than `cutoff` are treated as redundant terms
## select one representative term by applying `select_fun` to feature specifying by `by`.
## user can defined their own `select_fun` function.
## return an updated `enrichResult` object.
}
但是这只能用来对富集分析的结果用,同时会删除那些冗余的term。而像david和metascape在进行富集分析的时候,尤其是metascape,会保留冗余的term,同时把代表性的term(p最小)定义为summary,而其它term定义为member。
五、做注释分析(为每个基因找到它们的GO term)
有点慢,建议用godata()的返回值自己检索
groupGO("Hmgcs2",org.Mm.eg.db,"SYMBOL","BP",level = 5) ->res
res@result%>%filter(Count>0)
ID Description Count GeneRatio geneID
GO:0006139 GO:0006139 nucleobase-containing compound metabolic process 1 1/1 Hmgcs2
GO:0043603 GO:0043603 cellular amide metabolic process 1 1/1 Hmgcs2
GO:0072521 GO:0072521 purine-containing compound metabolic process 1 1/1 Hmgcs2
GO:0008299 GO:0008299 isoprenoid biosynthetic process 1 1/1 Hmgcs2
GO:0008654 GO:0008654 phospholipid biosynthetic process 1 1/1 Hmgcs2
GO:0046951 GO:0046951 ketone body biosynthetic process 1 1/1 Hmgcs2
GO:0046165 GO:0046165 alcohol biosynthetic process 1 1/1 Hmgcs2
GO:0008610 GO:0008610 lipid biosynthetic process 1 1/1 Hmgcs2
GO:0090407 GO:0090407 organophosphate biosynthetic process 1 1/1 Hmgcs2
GO:1901362 GO:1901362 organic cyclic compound biosynthetic process 1 1/1 Hmgcs2
GO:1901570 GO:1901570 fatty acid derivative biosynthetic process 1 1/1 Hmgcs2
GO:1901617 GO:1901617 organic hydroxy compound biosynthetic process 1 1/1 Hmgcs2
GO:1902224 GO:1902224 ketone body metabolic process 1 1/1 Hmgcs2
GO:0035383 GO:0035383 thioester metabolic process 1 1/1 Hmgcs2
GO:0006796 GO:0006796 phosphate-containing compound metabolic process 1 1/1 Hmgcs2
GO:0019637 GO:0019637 organophosphate metabolic process 1 1/1 Hmgcs2
GO:0006644 GO:0006644 phospholipid metabolic process 1 1/1 Hmgcs2
GO:0006720 GO:0006720 isoprenoid metabolic process 1 1/1 Hmgcs2
GO:0055086 GO:0055086 nucleobase-containing small molecule metabolic process 1 1/1 Hmgcs2
GO:0008202 GO:0008202 steroid metabolic process 1 1/1 Hmgcs2
GO:0044255 GO:0044255 cellular lipid metabolic process 1 1/1 Hmgcs2
GO:1902652 GO:1902652 secondary alcohol metabolic process 1 1/1 Hmgcs2
GO:0006753 GO:0006753 nucleoside phosphate metabolic process 1 1/1 Hmgcs2
GO:0019693 GO:0019693 ribose phosphate metabolic process 1 1/1 Hmgcs2
GO:0006637 GO:0006637 acyl-CoA metabolic process 1 1/1 Hmgcs2
GO:0006066 GO:0006066 alcohol metabolic process 1 1/1 Hmgcs2
GO:0016125 GO:0016125 sterol metabolic process 1 1/1 Hmgcs2
六、对KEGG的module进行富集
相比于kegg通路,module反映了通路内局部的反应过程,代表了更准确的信息
mkk <- enrichMKEGG(gene = gene,
organism = 'hsa',
pvalueCutoff = 1,
qvalueCutoff = 1)
head(mkk)
## ID Description GeneRatio
## M00912 M00912 NAD biosynthesis, tryptophan => quinolinate => NAD 2/9
## M00095 M00095 C5 isoprenoid biosynthesis, mevalonate pathway 1/9
## M00053 M00053 Pyrimidine deoxyribonuleotide biosynthesis, CDP => dCTP 1/9
## M00938 M00938 Pyrimidine deoxyribonuleotide biosynthesis, UDP => dTTP 1/9
## M00003 M00003 Gluconeogenesis, oxaloacetate => fructose-6P 1/9
## M00049 M00049 Adenine ribonucleotide biosynthesis, IMP => ADP,ATP 1/9
## BgRatio pvalue p.adjust qvalue geneID Count
## M00912 12/829 0.006541756 0.03925053 0.03443029 23475/3620 2
## M00095 10/829 0.103949536 0.20710097 0.18166751 3158 1
## M00053 11/829 0.113796244 0.20710097 0.18166751 6241 1
## M00938 14/829 0.142761862 0.20710097 0.18166751 6241 1
## M00003 18/829 0.180072577 0.20710097 0.18166751 5105 1
## M00049 21/829 0.207100966 0.20710097 0.18166751 26289 1
七、对KEGG通路可视化
相当于生成一条kegg库书写规则的url,至于进一步可视化,可以使用pathwayview完成
browseKEGG(kk, 'hsa04110')