15.GO和KEGG分析

非模式生物--大麦的GO和KEGG分析

分析思路
1.首先查看Annotationhub(https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html中是否存在其相应的注释信息;
2.如果在Annotationhub中不存在的,则需要自己用AnnotationForge构建(https://bioconductor.org/packages/release/bioc/html/AnnotationForge.html

参考:
测序了,然后呢(四)有了OrgDb才能进行富集分析https://www.jianshu.com/p/9c9e97167377

一、按照Annotationhub进行操作

安装必要的包

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("AnnotationHub")
由于R的版本是3.6 ;对应的包的要求R的版本是4.0,没有半大下载,只能通过下载tar.gz的包进行手动安装
失败了重新尝试其他方法

二、利用AnnotationForge 进行自行构建

安装eggNOG mapper

在服务器上操作
conda install eggnog-mapper
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - eggnog-mapper

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.


[root@175 u2]# anaconda search -t conda eggnog-mapper
Using Anaconda API: https://api.anaconda.org
Packages:
     Name                      |  Version | Package Types   | Platforms       | Builds
     ------------------------- |   ------ | --------------- | --------------- | ----------
     bioconda/eggnog-mapper    |    2.0.1 | conda           | linux-64, noarch, osx-64 | py27_2, py27_1, py27_0, py_0, py_1, py_3
                                          : Fast genome-wide functional annotation through orthology assignment.
Found 1 packages

Run 'anaconda show <USER/PACKAGE>' to get installation details
[root@175 u2]# anaconda show  bioconda/eggnog-mapper
Using Anaconda API: https://api.anaconda.org
Name:    eggnog-mapper
Summary: Fast genome-wide functional annotation through orthology assignment.
Access:  public
Package Types:  conda
Versions:
   + 1.0.0
   + 1.0.1
   + 1.0.2
   + 1.0.3
   + 2.0.0
   + 2.0.1

To install this package with conda run:
     conda install --channel https://conda.anaconda.org/bioconda eggnog-mapper
[root@175 u2]# conda install --channel https://conda.anaconda.org/bioconda eggnog-mapper
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: /
Found conflicts! Looking for incompatible packages.                                                                                                         failed

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - eggnog-mapper -> python[version='2.7.*|<3|>=2.7,<2.8.0a0']

Your python: python=3

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

发现是因为python的版本不匹配,利用wget

mkdir software/eggnog-mapper
cd software/eggnog-mapper
wget https://github.com/jhcepas/eggnog-mapper/archive/1.0.3.tar.gz
tar xvzf 1.0.3.tar.gz
cd eggnog-mapper-1.0.3

设置环境


参考
http://www.chenlianfu.com/?p=2804
https://www.jianshu.com/p/7a3b714e6ccf
https://www.jianshu.com/p/e0abf65ca1c5
https://www.jianshu.com/p/5d12f4d8c052



##2020.8.7 继续进行GO分析
通过查看自己的大麦数据,发现一个基因对应了好多的GO号,没有必要按照


image.png
awk -F'\t'  '{print $1,$2,$6,$7,$12, $13}' diamond.emapper.annotations > barley.emapper.annotations.txt
###加载管道的包
install.packages('magrittr')
library(magrittr)
 library(stringr)
 library(dplyr)

setwd(“”)
egg_f <- "diamond.emapper.annotations"
egg <- read.csv(egg_f, sep = "\t")
egg[egg==""]<-NA #这个代码来自花花的指导(将空行变成NA,方便下面的去除)
gene_info <- egg %>%
        dplyr::select(GID = query_name, GENENAME = `eggNOG.annot`) %>% na.omit()
# 先构建一个空的数据框(弄好大体的架构,表示其中要有GID =》query_name,GO =》GO号, EVIDENCE =》默认IDA)
# 关于IEA:就是一个标准,除了这个标准以外还有许多。IEA就是表示我们的注释是自动注释,无需人工检查http://wiki.geneontology.org/index.php/Inferred_from_Electronic_Annotation_(IEA)
# 两种情况下需要用IEA:1. manually constructed mappings between external classification systems and GO terms; 2.automatic transfer of annotation to orthologous gene products.
gene2go <- data.frame(GID = character(),
                         GO = character(),
                         EVIDENCE = character())

# 然后向其中填充:注意到有的query_name对应多个GO,因此我们以GO号为标准,每一行只能有一个GO号,但query_name和Evidence可以重复
for (row in 1:nrow(gterms)) {
        gene_terms <- str_split(gterms[row,"GO_terms"], ",", simplify = FALSE)[[1]]  
        gene_id <- gterms[row, "query_name"][[1]]
        tmp <- data_frame(GID = rep(gene_id, length(gene_terms)),
                              GO = gene_terms,
                              EVIDENCE = rep("IEA", length(gene_terms)))
        gene2go <- rbind(gene2go, tmp)
    }  






options(stringsAsFactors = F)
if(F){
    # 需要下载 json文件(这是是经常更新的)
    # https://www.genome.jp/kegg-bin/get_htext?ko00001
    # 代码来自:http://www.genek.tv/course/225/task/4861/show
    library(jsonlite)
    library(purrr)
    library(RCurl)
    
    update_kegg <- function(json = "/u2/new2020barley/barley_annotion/gene_annotation_morex_v2/mascher_IPK-GATERSLEBEN.DE/barley_annotation_morexv2/gene_annotation/ko00001.json") {
        pathway2name <- tibble(Pathway = character(), Name = character())
        ko2pathway <- tibble(Ko = character(), Pathway = character())
        
        kegg <- fromJSON(json)
        
        for (a in seq_along(kegg[["children"]][["children"]])) {
            A <- kegg[["children"]][["name"]][[a]]
            
            for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
                B <- kegg[["children"]][["children"]][[a]][["name"]][[b]] 
                
                for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
                    pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
                    
                    pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
                    pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
                    pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
                    
                    kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
                    
                    kos <- str_match(kos_info, "K[0-9]*")[,1]
                    
                    ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
                }
            }
        }
        
        save(pathway2name, ko2pathway, file = "/u2/new2020barley/barley_annotion/gene_annotation_morex_v2/mascher_IPK-GATERSLEBEN.DE/barley_annotation_morexv2/gene_annotation/kegg_info.RData")
    }
    
    update_kegg(json = "/u2/new2020barley/barley_annotion/gene_annotation_morex_v2/mascher_IPK-GATERSLEBEN.DE/barley_annotation_morexv2/gene_annotation/ko00001.json")
    
}

options(stringsAsFactors = F)
if(F){
    # 需要下载 json文件(这是是经常更新的)
    # https://www.genome.jp/kegg-bin/get_htext?ko00001
    # 代码来自:http://www.genek.tv/course/225/task/4861/show
    library(jsonlite)
    library(purrr)
    library(RCurl)
    library("rjson")
    
    update_kegg <- function(json = "ko00001.json") {
        pathway2name <- tibble(Pathway = character(), Name = character())
        ko2pathway <- tibble(Ko = character(), Pathway = character())
        
        kegg <- fromJSON(file="ko00001.json")
        
        for (a in seq_along(kegg[["children"]][["children"]])) {
            A <- kegg[["children"]][["name"]][[a]]
            
            for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
                B <- kegg[["children"]][["children"]][[a]][["name"]][[b]] 
                
                for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
                    pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
                    
                    pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
                    pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
                    pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
                    
                    kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
                    
                    kos <- str_match(kos_info, "K[0-9]*")[,1]
                    
                    ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
                }
            }
        }
        
        save(pathway2name, ko2pathway, file = "kegg_info.RData")
    }
    
    update_kegg(json = "ko00001.json")
    
}

得不到kegg_info.RData 这个结果
下一步没有办法load加载
尝试另一个方法
数据的格式要求是一个基因对应一个GO这样的格式首先把自己的数据修改格式
这个里面的问题是(a in seq_along(kegg[["children"]][["children"]]) 中的a没有数值,里面提取不出来,我也看不懂这个是什么意思,给作者发邮件作者也没有回,有他女朋友的微信,尝试询问他女朋友,她说是他写的我应该去问他,她不知道;嗯。。。。,所以就尝试用牛逼的Y叔的包clusterProfiler,这个包可能我用的不熟练,导致我只能进行GO分析,只能得到GO的description,而KEGG的ko好对应的description没有对应的关系。所以进行查找另外的方法得到KEGG的ko号对应的description。

看了一个简书的公众号,也是用的这个方法,他得到了kegg_info.RData这个结果,关注了他的公众号后后台问他要了kegg_info.RData


三、获得KEGG注释

参考:https://www.jianshu.com/p/6f9e6ed3400d KEGG pathway 注释整理
通过eggnog-mapperinterproscan两个软件(或数据库),可以获得KEGG ORTHOLOGY(KO)的注释,即基因或者转录本对应的K number, 具体参见两个软件的wiki.

获得KO与pathway的关系

进入KEGG官网,然后点击KEGG BRITE进入该数据库,在这个数据库中可以下载KEGG数据库中手工创建的层次结构文件(BRITE hierarchy files)。在这里,需要下载包含pathway和KO对应关系的文件,点击KEGG Orthology (KO)下载,这里下载json版本。

import json
import re

with open("ko00001.json") as f:
    ko_map_data = json.load(f)

with open("KEGG_pathway_ko.txt", "w") as oh:
    line = "level1_pathway_id\tlevel1_pathway_name\tlevel2_pathway_id\tlevel2_pathway_name"
    line += "\tlevel3_pathway_id\tlevel3_pathway_name\tko\tko_name\tko_des\tec\n"
    oh.write(line)
    for level1 in ko_map_data["children"]:
        m = re.match(r"(\S+)\s+([\S\w\s]+)", level1["name"])
        level1_pathway_id = m.groups()[0].strip()
        level1_pathway_name = m.groups()[1].strip()
        for level2 in level1["children"]:
            m = re.match(r"(\S+)\s+([\S\w\s]+)", level2["name"])
            level2_pathway_id = m.groups()[0].strip()
            level2_pathway_name = m.groups()[1].strip()
            for level3 in level2["children"]:
                m = re.match(r"(\S+)\s+([^\[]*)", level3["name"])
                level3_pathway_id = m.groups()[0].strip()
                level3_pathway_name = m.groups()[1].strip()
                if "children" in level3:
                    for ko in level3["children"]:
                        m = re.match(r"(\S+)\s+(\S+);\s+([^\[]+)\s*(\[EC:\S+(?:\s+[^\[\]]+)*\])*", ko["name"])
                        if m is not None:
                            ko_id = m.groups()[0].strip()
                            ko_name = m.groups()[1].strip()
                            ko_des = m.groups()[2].strip()
                            ec = m.groups()[3]
                            if ec==None:
                                ec = "-"
                        line = level1_pathway_id + "\t" + level1_pathway_name + "\t" + level2_pathway_id + "\t" + level2_pathway_name
                        line += "\t" + level3_pathway_id + "\t" + level3_pathway_name + "\t" + ko_id + "\t" + ko_name + "\t" + ko_des + "\t" + ec + "\n"
                        oh.write(line)

这会生成KEGG_pathway_ko.txt文件,随后对行去重。

import pandas as pd

data = pd.read_csv("KEGG_pathway_ko.txt", sep="\t",dtype=str)

data = data.drop_duplicates()

data.to_csv("KEGG_pathway_ko_uniq.txt", index=False, sep="\t")

最后得到KEGG_pathway_ko_uniq.txt文件,这个文件包含了KO和KEGG pathway的对应关系信息,也包含了pathway的级别分类(KEGG pathway分为3级),如下所示:

level1_pathway_id   level1_pathway_name level2_pathway_id   level2_pathway_name level3_pathway_id   level3_pathway_name ko  ko_name ko_des  ec
9100    Metabolism  9101    Carbohydrate metabolism 10  Glycolysis / Gluconeogenesis    K00844  HK  hexokinase  [EC:2.7.1.1]
9100    Metabolism  9101    Carbohydrate metabolism 10  Glycolysis / Gluconeogenesis    K12407  GCK glucokinase [EC:2.7.1.2]
9100    Metabolism  9101    Carbohydrate metabolism 10  Glycolysis / Gluconeogenesis    K00845  glk glucokinase [EC:2.7.1.2]

合并结果
现在是表格文件,和容易将上面多种对应关系合并起来,进行后续的分析,例如可以对KEGG的注释结果按照KEGG中通路类型或者不同的level进行分类汇总,又或者对特定的基因集进行KEGG pathway的富集分析等。

四、

安装必要的包

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("clusterProfiler")

###用大麦参考基因组的蛋白自己做的GO和KEGG文件得到的注释信息
setwd("E:\\mywork\\大麦转录组\\GO_result\\")
gene2go <- read.csv("barley_gene2go.csv",header = T)

dim(gene2go)
head(gene2go)
setwd("E:\\mywork\\大麦转录组\\WGCNA_result\\2020.8_WGCNA_result\\WGCN_result\\服务器\\model_result\\VRS网络图\\VRS_GO_result\\")
gene_list <- read.table("VRS4_GO_KEGG_inputdata.txt",header=T)

freq_d2 <- gene2go[which(gene2go$GID %in% gene_list$x),]
dim(freq_d2)
head(freq_d2)
gene_list2 <- freq_d2[,2]
length(gene_list2)
term2gene<-gene2go[,c(3,2)]
head(term2gene)
df<-enricher(gene=gene_list2,
             pvalueCutoff = 1,
             pAdjustMethod = "BH",
             TERM2GENE = term2gene)
### pvalueCutoff = 1 正常的是0.05



df<-as.data.frame(df)
df
df1<-go2term(df$ID)
dim(df1)
head(df1)
colnames(df1) <- c("ID","Term")
df3 <- merge(df1,df,by = 'ID')
dim(df3)


df2<-go2ont(df$ID)
dim(df2)
head(df2)
df3$Ont<-df2$Ontology

head(df3)
write.csv(df3,"VRS4_GO_result.csv")
df4<-df3[,c(2,11,6)]###提取了Term  Ont Pvalue这三列,也可以提取count这一列

head(df4)
library(ggplot2)
ggplot(df4,aes(x=Term,y=-log10(pvalue)))+
  geom_col(aes(fill=Ont))+
  coord_flip()+labs(x="")+
  theme_bw()














###KEGG分析
#rm(list=ls())
setwd("E:\\mywork\\大麦转录组\\GO_result\\")
gene2ko <- read.csv("barley_gene2ko.csv",header = T)
head(gene2ko)
dim(gene2ko)
#choose.files()
ko_information <-read.csv("E:\\mywork\\大麦转录组\\KEGG_result\\KEGG.information.csv",header=T)
head(ko_information)
ko_information2 <- ko_information[,c(2,7)]
head(ko_information2)
ko_information3 <- ko_information[,c(7,8)]

freq_d2 <- gene2ko[which(gene2ko$GID %in% gene_list$x),]
dim(freq_d2)
head(freq_d2)
gene_list2 <- freq_d2[,2]
length(gene_list2)
term2gene<-gene2ko[,c(3,2)]
head(term2gene)
freq_d3<-left_join(term2gene, ko_information2, by=c('Ko'='ID'))
head(freq_d3)
df<-enricher(gene=gene_list2,
             pvalueCutoff = 1,
             TERM2GENE =freq_d3[c('lev2_lev3_id', 'GID')],
             TERM2NAME=ko_information3)
##pvalueCutoff = 0.05 是正常的
head(df)
result <- as.data.frame(df@result)
setwd("E:\\mywork\\大麦转录组\\WGCNA_result\\2020.8_WGCNA_result\\WGCN_result\\服务器\\model_result\\VRS网络图\\VRS_GO_result\\")
write.csv(result,"VRS4_KEGG_result.csv")
head(result)
head(df)
barplot(df)

https://www.jianshu.com/p/49ceeae7fd95
https://www.jianshu.com/p/3a7149c1aee9
https://www.jianshu.com/p/4c2f2f578a43
https://www.jianshu.com/p/04cca6648439
https://www.bioinfo-scrounger.com/archives/512/
https://www.jianshu.com/p/9c9e97167377
https://www.jianshu.com/p/63c46de5f18b
https://www.jianshu.com/p/e646c0fa6443
https://www.jianshu.com/p/b4997c1165df
https://www.jianshu.com/p/bcdbf80701e2
https://zhuanlan.zhihu.com/p/43651419
https://blog.csdn.net/u012110870/article/details/102804318

https://www.jianshu.com/p/47b5ea646932
https://blog.csdn.net/weixin_43569478/article/details/83743975
https://blog.csdn.net/weixin_43569478/article/details/83744242
https://www.jianshu.com/p/22faee28c6a6

https://blog.csdn.net/weixin_41151172/article/details/80382567
https://zhuanlan.zhihu.com/p/78093534 根据GO/KEGG富集结果选择自己的研究方向
https://www.bioinfo-scrounger.com/archives/639/ 可视化kegg通路-pathview包 (重点看)

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,711评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,079评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,194评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,089评论 1 286
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,197评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,306评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,338评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,119评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,541评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,846评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,014评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,694评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,322评论 3 318
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,026评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,257评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,863评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,895评论 2 351

推荐阅读更多精彩内容