非模式物种的GO和KEGG富集

对不能直接使用clusterprofiler包中的数据库进行KEGG和GO富集的物种进行分析。

1.从基因组fasta文件中提取CDS序列

基于有参转录组测序，从NCBI中下载物种的参考基因组fasta文件和基因组注释gtf文件，使用gffread从参考基因组fasta文件中提取CDS序列：

gffread Genomic.gtf -g Phytolacca_americana.fna -x CDS.fna

2.将CDS核酸序列转换为蛋白质序列

使用seqkit工具将CDS核酸序列转换为蛋白质序列：

seqkit translate CDS.fna >CDS_protein.fna

3.将蛋白质序列进行比对

使用emapper.py工具将蛋白质序列与数据库进行比对，对蛋白质序列进行注释：

nohup emapper.py -i CDS_protein.fna -o output --cpu 40 --dbmem -m diamond &

4.整理蛋白质比对结果

使用emapper.py比对后得到的蛋白质序列注释结果output.emapper.annotations结构如下：

image.png

删除前四行，并将第五行第一列中的#删除。
这里的query id是转录本id，利用GTF文件中转录本id与gene id的对应关系，将其转换为gene id ，并只保留gene_id、GOs和KEGG_ko列，保存文件为sub.emapper.annotations。

image.png

5.生成gene2go与gene2ko文件

gene2go文件即上述文件中的gene_id与GOs的对应关系文件，gene2ko文件即上述文件中的gene_id与KEGG_ko的对应关系文件，使用python进行整理：

import pandas as pd
data=pd.read_csv("./sub.emapper.annotations",sep="\t")
gene1=[]
gene2=[]
ko=[]
go=[]
evi=[]
for index,row in data.iterrows():
    if row[2]!="-":
        for i in row[2].split(","):
            gene1.append(row[0])
            ko.append(i)
    if row[1]!="-":
        for i in row[1].split(","):
            gene2.append(row[0])
            go.append(i)
            evi.append("IEA")
df=pd.DataFrame({"GID":gene1,"KO":ko})
print(df.head())
df.to_csv("gene2ko",sep="\t",index=False)
df=pd.DataFrame({"GID":gene2,"GO":go,"EVIDENCE":evi})
print(df.head())
df.to_csv("gene2go",sep="\t",index=False)

6.GO富集

主要是使用enricher函数，需要加载clusterProfiler包：

go2name <- read.delim('GO.library', header = FALSE, stringsAsFactors = FALSE)
names(go2name) <- c('ID', 'Description', 'Ontology')
gene2go<-read.table("gene2go",sep="\t",quote="",header=1)
colnames(gene2go)<-c("gene","term","EVIDENCE")
gene2go<-distinct(gene2go)
gene2go<-gene2go[gene2go$term%in%go2name$ID,]
go_rich <- enricher(
          gene  = gene$V1,
          TERM2GENE = gene2go[c('term','gene')],
          TERM2NAME = go2name[c('ID', 'Description')],
          pAdjustMethod = "BH",
          pvalueCutoff  = 1,
          qvalueCutoff  = 1)

7.KEGG富集

需要构建ko与name的关系列表：

pathway2name <- tibble(Pathway = character(), Name = character())
ko2pathway <- tibble(Ko = character(), Pathway = character())
kegg <- fromJSON(KEGG_info)
for (a in seq_along(kegg[["children"]][["children"]])) {
  A <- kegg[["children"]][["name"]][[a]]
  for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
    B <- kegg[["children"]][["children"]][[a]][["name"]][[b]] 
    for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
      pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
      pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
      pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
      pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
      kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
      kos <- str_match(kos_info, "K[0-9]*")[,1]
      ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
    }
  }
}

使用左连接保存为ko2name文件，其结构如下：

image.png

进行KEGG富集：

#kegg
ko2name<-read.table("/mnt/home/huanggy/rna_scripts/5_rich/KEGG_self/ko2name",header=1,sep="\t",quote="")
colnames(ko2name)<-c("ID","Pathway","Description")
#ko2name$ID<-paste0("ko:",ko2name$ID)
gene2kegg<-read.table("/mnt/home/huanggy/rna_scripts/5_rich/KEGG_self/Phy_a/gene2ko",sep="\t",quote="",header=1)
colnames(gene2kegg)<-c("gene","term")
gene2kegg$Pathway<-ko2name[match(gene2kegg$term,paste0("ko:",ko2name$ID)),"Pathway"]
gene2kegg<-na.omit(gene2kegg)
ko_rich <- enricher(
          gene  = gene$V1,
          TERM2GENE = gene2kegg[c('Pathway','gene')],
          TERM2NAME = ko2name[c('Pathway', 'Description')],
          pAdjustMethod = "BH",
          pvalueCutoff  = 1,
          qvalueCutoff  = 1)

非模式物种的GO和KEGG富集