对不能直接使用clusterprofiler包中的数据库进行KEGG和GO富集的物种进行分析。
1.从基因组fasta文件中提取CDS序列
基于有参转录组测序,从NCBI中下载物种的参考基因组fasta文件和基因组注释gtf文件,使用gffread从参考基因组fasta文件中提取CDS序列:
gffread Genomic.gtf -g Phytolacca_americana.fna -x CDS.fna
2.将CDS核酸序列转换为蛋白质序列
使用seqkit工具将CDS核酸序列转换为蛋白质序列:
seqkit translate CDS.fna >CDS_protein.fna
3.将蛋白质序列进行比对
使用emapper.py工具将蛋白质序列与数据库进行比对,对蛋白质序列进行注释:
nohup emapper.py -i CDS_protein.fna -o output --cpu 40 --dbmem -m diamond &
4.整理蛋白质比对结果
使用emapper.py比对后得到的蛋白质序列注释结果output.emapper.annotations结构如下:

image.png
删除前四行,并将第五行第一列中的#删除。
这里的query id是转录本id,利用GTF文件中转录本id与gene id的对应关系,将其转换为gene id ,并只保留gene_id、GOs和KEGG_ko列,保存文件为sub.emapper.annotations。

image.png
5.生成gene2go与gene2ko文件
gene2go文件即上述文件中的gene_id与GOs的对应关系文件,gene2ko文件即上述文件中的gene_id与KEGG_ko的对应关系文件,使用python进行整理:
import pandas as pd
data=pd.read_csv("./sub.emapper.annotations",sep="\t")
gene1=[]
gene2=[]
ko=[]
go=[]
evi=[]
for index,row in data.iterrows():
if row[2]!="-":
for i in row[2].split(","):
gene1.append(row[0])
ko.append(i)
if row[1]!="-":
for i in row[1].split(","):
gene2.append(row[0])
go.append(i)
evi.append("IEA")
df=pd.DataFrame({"GID":gene1,"KO":ko})
print(df.head())
df.to_csv("gene2ko",sep="\t",index=False)
df=pd.DataFrame({"GID":gene2,"GO":go,"EVIDENCE":evi})
print(df.head())
df.to_csv("gene2go",sep="\t",index=False)
6.GO富集
主要是使用enricher函数,需要加载clusterProfiler包:
go2name <- read.delim('GO.library', header = FALSE, stringsAsFactors = FALSE)
names(go2name) <- c('ID', 'Description', 'Ontology')
gene2go<-read.table("gene2go",sep="\t",quote="",header=1)
colnames(gene2go)<-c("gene","term","EVIDENCE")
gene2go<-distinct(gene2go)
gene2go<-gene2go[gene2go$term%in%go2name$ID,]
go_rich <- enricher(
gene = gene$V1,
TERM2GENE = gene2go[c('term','gene')],
TERM2NAME = go2name[c('ID', 'Description')],
pAdjustMethod = "BH",
pvalueCutoff = 1,
qvalueCutoff = 1)
7.KEGG富集
需要构建ko与name的关系列表:
pathway2name <- tibble(Pathway = character(), Name = character())
ko2pathway <- tibble(Ko = character(), Pathway = character())
kegg <- fromJSON(KEGG_info)
for (a in seq_along(kegg[["children"]][["children"]])) {
A <- kegg[["children"]][["name"]][[a]]
for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
B <- kegg[["children"]][["children"]][[a]][["name"]][[b]]
for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
kos <- str_match(kos_info, "K[0-9]*")[,1]
ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
}
}
}
使用左连接保存为ko2name文件,其结构如下:

image.png
进行KEGG富集:
#kegg
ko2name<-read.table("/mnt/home/huanggy/rna_scripts/5_rich/KEGG_self/ko2name",header=1,sep="\t",quote="")
colnames(ko2name)<-c("ID","Pathway","Description")
#ko2name$ID<-paste0("ko:",ko2name$ID)
gene2kegg<-read.table("/mnt/home/huanggy/rna_scripts/5_rich/KEGG_self/Phy_a/gene2ko",sep="\t",quote="",header=1)
colnames(gene2kegg)<-c("gene","term")
gene2kegg$Pathway<-ko2name[match(gene2kegg$term,paste0("ko:",ko2name$ID)),"Pathway"]
gene2kegg<-na.omit(gene2kegg)
ko_rich <- enricher(
gene = gene$V1,
TERM2GENE = gene2kegg[c('Pathway','gene')],
TERM2NAME = ko2name[c('Pathway', 'Description')],
pAdjustMethod = "BH",
pvalueCutoff = 1,
qvalueCutoff = 1)