ChIP-seq
、ATAC-seq
等获取到基因组范围内的peak后,需要对其进行基因注释以便进行后续的分析。ChIPseeker
做为peak注释的R包,用来很不错,但有时候其结果还是满足不了个性化的需要。比如,只想关注某些特定类型的基因,如protein_coding
、lincRNA
、snRNA
,这个时候就得自己想办法添加基因类型了,因为ChIPseeker
注释结果没有这些信息。
正好本人遇到过这类的需求,那就分享一下自己的解决办法,让需要的朋友可以不必浪费时间便可轻松跨过这个小坑。下面用macs2
的peak结果来说明:
chr1 3472147 3472924 sample1_peak_1 144 . 7.43822 16.84734 14.45499 236
chr1 3670777 3672287 sample1_peak_2 143 . 6.49641 16.75010 14.36088 485
chr1 3872843 3874430 sample1_peak_3 631 . 9.24537 66.46637 63.18048 630
chr1 4234288 4235914 sample1_peak_4 98 . 5.48185 12.13378 9.86325 1209
chr1 4496702 4497572 sample1_peak_5 87 . 5.65304 10.96738 8.72978 689
chr1 4590520 4591742 sample1_peak_6 242 . 6.89093 26.86273 24.25464 996
chr1 4608160 4611414 sample1_peak_7 275 . 3.18758 30.21178 27.53851 2034
chr1 4629533 4629848 sample1_peak_8 53 . 4.46293 7.47678 5.35324 81
chr1 4664981 4665332 sample1_peak_9 38 . 3.86787 5.89028 3.82913 183
chr1 4722306 4724297 sample1_peak_10 403 . 4.77592 43.30846 40.39729 1322
ChIPseeker
默认情况下注释的结果:
seqnames start end width strand V4 V5 V6 V7 V8
1 chr1 3472148 3472924 777 * sample1_peak_1 144 . 7.43822 16.84734
2 chr1 3670778 3672287 1510 * sample1_peak_2 143 . 6.49641 16.75010
3 chr1 3872844 3874430 1587 * sample1_peak_3 631 . 9.24537 66.46637
4 chr1 4234289 4235914 1626 * sample1_peak_4 98 . 5.48185 12.13378
5 chr1 4496703 4497572 870 * sample1_peak_5 87 . 5.65304 10.96738
6 chr1 4590521 4591742 1222 * sample1_peak_6 242 . 6.89093 26.86273
V9 V10
1 14.45499 236
2 14.36088 485
3 63.18048 630
4 9.86325 1209
5 8.72978 689
6 24.25464 996
annotation geneChr
1 Intron (ENSMUST00000161581.1/ENSMUSG00000089699.1, intron 1 of 1) 1
2 Promoter (<=1kb) 1
3 Distal Intergenic 1
4 Intron (ENSMUST00000208660.1/ENSMUSG00000025900.13, intron 12 of 29) 1
5 Promoter (<=1kb) 1
6 Distal Intergenic 1
geneStart geneEnd geneLength geneStrand geneId
1 3464977 3467285 2309 2 ENSMUSG00000103025.1
2 3214482 3671498 457017 2 ENSMUSG00000051951.5
3 3783876 3783933 58 2 ENSMUSG00000088333.2
4 4256234 4260519 4286 2 ENSMUSG00000102948.1
5 4491250 4496757 5508 2 ENSMUSG00000025902.13
6 4583129 4586252 3124 2 ENSMUSG00000104328.1
transcriptId distanceToTSS
1 ENSMUST00000194099.1 -4863
2 ENSMUST00000070533.4 0
3 ENSMUST00000157708.2 -88911
4 ENSMUST00000195384.1 24605
5 ENSMUST00000195555.1 0
6 ENSMUST00000195771.1 -4269
在原有的基础追加了gene_name
、gene_type
两列信息:
seqnames start end width name score stand signal pvalue
1 chr1 3472148 3472924 777 sample1_peak_1 144 . 7.43822 16.84734
2 chr1 3670778 3672287 1510 sample1_peak_2 143 . 6.49641 16.75010
3 chr1 3872844 3874430 1587 sample1_peak_3 631 . 9.24537 66.46637
4 chr1 4234289 4235914 1626 sample1_peak_4 98 . 5.48185 12.13378
5 chr1 4496703 4497572 870 sample1_peak_5 87 . 5.65304 10.96738
6 chr1 4590521 4591742 1222 sample1_peak_6 242 . 6.89093 26.86273
qvalue summit annotation geneChr geneStart geneEnd geneLength
1 14.45499 236 Intron chr1 3464977 3467285 2309
2 14.36088 485 Promoter (<=1kb) chr1 3214482 3671498 457017
3 63.18048 630 Distal Intergenic chr1 3783876 3783933 58
4 9.86325 1209 Intron chr1 4256234 4260519 4286
5 8.72978 689 Promoter (<=1kb) chr1 4491250 4496757 5508
6 24.25464 996 Distal Intergenic chr1 4583129 4586252 3124
geneStrand geneId transcriptId distanceToTSS gene_name
1 2 ENSMUSG00000103025.1 ENSMUST00000194099.1 -4863 Gm37686
2 2 ENSMUSG00000051951.5 ENSMUST00000070533.4 0 Xkr4
3 2 ENSMUSG00000088333.2 ENSMUST00000157708.2 -88911 Gm27396
4 2 ENSMUSG00000102948.1 ENSMUST00000195384.1 24605 Gm6101
5 2 ENSMUSG00000025902.13 ENSMUST00000195555.1 0 Sox17
6 2 ENSMUSG00000104328.1 ENSMUST00000195771.1 -4269 Gm37323
gene_type
1 TEC
2 protein_coding
3 snRNA
4 processed_pseudogene
5 protein_coding
6 lincRNA
为了方便以后遇到类似需求,写了一个R脚本,可以用来做这个事,分析时只需在命令直接调用即可,方便快捷:
#!/usr/bin/Rscript
suppressPackageStartupMessages(library(optparse))
option_list = list(make_option(c("-i","--input"), help = "peak file, multiple are separated by commas."),
make_option(c("-e","--species"), help = "optional value is human or mouse, [default: %default]."),
make_option(c("-g","--gtf"), help = "genome gtf, [default: %default]."),
make_option(c("-l","--label"), help = "sample label, multiple are separated by commas, [default: %default]."),
make_option(c("-a","--atype"), help = "annotation type for analysis, multiple are separated by commas, [default: %default]."),
make_option(c("-f","--ftype"), help = "gene type for analysis, multiple are separated by commas, [default: %default]."),
make_option(c("-u","--up"), default = 3000, help = "length of tss upstream, [default: %default]."),
make_option(c("-d","--dn"), default = 3000, help = "length of tss downstream, [default: %default]."),
make_option(c("-c","--cname"), default='name,score,stand,signal,pvalue,qvalue,summit', help = "colnames from 3td column to end in peak file, [default: '%default']."),
make_option(c("-s","--save"), default = FALSE, action="store_true", help = "whether to save annotation txt, [default: %default]."),
make_option(c("-p","--prefix"), default = 'chipseeker', help = "the prefix for output, [default: '%default']."),
make_option(c("-o","--outdir"), default = '.', help = "output directory, [default: '%default']."))
opt_parser <- OptionParser(option_list = option_list,
description='\n\tThis script uesed to annotate peak using gtf by ChIPseeker, default format is for macs2 result.
\ntxdb: TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Mmusculus.UCSC.mm10.knownGene.
\nannotation type: Promoter, Exon, Intron, UTR, Downstream, Intergenic.
\ngene type: feature in gtf, protein_coding, snRNA, et al.')
args <- parse_args(opt_parser)
if(is.null(args$input) & (is.null(args$species) | is.null(args$gtf))){
print_help(opt_parser)
stop("argument input and species or gtf are required.")
}
suppressPackageStartupMessages(library(ChIPseeker))
suppressPackageStartupMessages(library(ggplot2))
if(!dir.exists(args$outdir)){
dir.create(args$outdir)
}
plotpie <- function(data){
if(class(data)=='csAnno'){
pdata <- data@annoStat
}else{
pdata <- data
}
pdata$ratio <- round(pdata$Frequency, 2)
pdata$label <- paste0(pdata$Feature, ' (', pdata$ratio,'%)')
col <- ChIPseeker:::getCols(nrow(pdata))
p <- ggplot(pdata,aes(x=1,ratio,fill=factor(label, levels=label))) +
geom_bar(stat='identity', color="black", show.legend=T) +
labs(fill="") +
coord_polar("y", start=0) +
scale_y_reverse() +
theme_void() +
scale_fill_manual(values=col) +
theme(legend.spacing.y=unit(0.1,'cm'), legend.box.margin=margin(1,10,1,-12), legend.key.size=unit(0.15,'inches'), legend.text=element_text(size=8)) +
guides(fill = guide_legend(byrow = T))
return(p)
}
anno <- function(input,species,gtf,label,atype,ftype,up,dn,cname,save,prefix,outdir){
if(!is.null(gtf) & !is.null(species)){
print('both arguments gtf and species are provided, will use gtf to annotate.')
}
if(!is.null(gtf)){
txdb <- GenomicFeatures::makeTxDbFromGFF(gtf)
egdb <- NULL
gtftab <- as.data.frame(rtracklayer::import(gtf))
gtftab <- unique(gtftab[gtftab$type == 'transcript', c('transcript_id', 'gene_name', 'gene_type')])
}else if(!is.null(species)){
if(species == 'human'){
suppressPackageStartupMessages(library(TxDb.Hsapiens.UCSC.hg38.knownGene))
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
egdb <- 'org.Hs.eg.db'
}else if(species == 'mouse'){
suppressPackageStartupMessages(library(TxDb.Mmusculus.UCSC.mm10.knownGene))
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
egdb <- 'org.Mm.eg.db'
}else{
stop("argument species is human or mouse.")
}
}else{
stop("please providing argument species or gtf.")
}
files <- strsplit(input, split='\\,')[[1]]
if(!is.null(label)){
labs <- strsplit(label, split='\\,')[[1]]
}else{
labs <- sapply(basename(files), tools::file_path_sans_ext)
}
alist <- list()
for(idx in 1:length(files)){
peak <- readPeakFile(files[idx])
peakanno <- annotatePeak(peak, tssRegion=c(-up, dn), TxDb=txdb, annoDb=egdb)
if(!is.null(atype)){
atype <- strsplit(atype, split='\\,')[[1]]
peakanno <- subset(peakanno, annotation %in% grep(paste0(atype, collapse='|'), annotation, value=T))
}
if(!is.null(ftype) & !is.null(gtf)){
types <- strsplit(ftype, split='\\,')[[1]]
genes <- gtftab[gtftab$gene_type %in% types, 'transcript_id']
peakanno <- subset(peakanno, transcriptId %in% genes)
}else if(!is.null(ftype) & is.null(gtf)){
print("warning: argument ftype takes effect only when gtf is provided.")
}
if(save){
out <- as.data.frame(peakanno)
ord <- out[, 6]
if(length(grep('^V',colnames(out))) != 0){
out <- base::subset(out, select=-strand)
colnames(out)[grep('^V',colnames(out))] <- strsplit(cname, split='\\,')[[1]]
}
out$geneChr <- out$seqnames
out$annotation <- sapply(out$annotation,function(x){ifelse(grepl('^Intron|^Exon', x), strsplit(x,split=' ')[[1]][1], x)})
if(!is.null(gtf)){
out$transcript_id <- out$transcriptId
out <- merge(out, gtftab, all.x=T)
out <- out[-1]
out <- out[match(ord, out[,5]),]
}else{
out$geneId <- ifelse(is.na(out$ENSEMBL),out$geneId,out$ENSEMBL)
out <- base::subset(out, select=-c(ENSEMBL, GENENAME))
}
write.table(out,paste0(outdir,'/',prefix,'.',labs[idx],'.anno.txt'),sep='\t',quote=F,row.names=F)
}
alist[[labs[idx]]] <- peakanno
}
if(length(alist)>1){
p <- plotAnnoBar(alist)
ggsave(paste0(outdir,'/',prefix,'.anno.pdf'), p, width=6, height = 2.5 + length(alist)/2)
}else{
#plotAnnoPie(alist[[1]])
p <- plotpie(alist[[1]])
ggsave(paste0(outdir,'/',prefix,'.',labs[1],'.anno.pdf'), p, width=5, height=3)
}
}
anno(args$input,args$species,args$gtf,args$label,args$atype,args$ftype,args$up,args$dn,args$cname,args$save,args$prefix,args$outdir)
为了保证脚本可以正常使用,需要保证环境里面已安装optparse
、ChIPseeker
、ggplot2
、rtracklayer
、GenomicFeatures
。如果不提供gtf
,还需要TxDb.Hsapiens.UCSC.hg38.knownGene
、TxDb.Mmusculus.UCSC.mm10.knownGene
、org.Hs.eg.db
、org.Mm.eg.db
数据库包。
由于经常遇到人和小鼠,所以这里--species
参数只设置了这两个物种,其他物种为方便分析可以选择使用gtf
模式。脚本的使用说明和参数如下:
Rscript get_annopeak_chipseeker.r --help
Usage: get_annopeak_chipseeker.r [options]
This script uesed to annotate peak using gtf by ChIPseeker, default format is for macs2 result.
txdb: TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Mmusculus.UCSC.mm10.knownGene.
annotation type: Promoter, Exon, Intron, UTR, Downstream, Intergenic.
gene type: feature in gtf, protein_coding, snRNA, et al.
Options:
-i INPUT, --input=INPUT
peak file, multiple are separated by commas.
-e SPECIES, --species=SPECIES
optional value is human or mouse, [default: NULL].
-g GTF, --gtf=GTF
genome gtf, [default: NULL].
-l LABEL, --label=LABEL
sample label, multiple are separated by commas, [default: NULL].
-a ATYPE, --atype=ATYPE
annotation type for analysis, multiple are separated by commas, [default: NULL].
-f FTYPE, --ftype=FTYPE
gene type for analysis, multiple are separated by commas, [default: NULL].
-u UP, --up=UP
length of tss upstream, [default: 3000].
-d DN, --dn=DN
length of tss downstream, [default: 3000].
-c CNAME, --cname=CNAME
colnames from 3td column to end in peak file, [default: 'name,score,stand,signal,pvalue,qvalue,summit'].
-s, --save
whether to save annotation txt, [default: FALSE].
-p PREFIX, --prefix=PREFIX
the prefix for output, [default: 'chipseeker'].
-o OUTDIR, --outdir=OUTDIR
output directory, [default: '.'].
-h, --help
Show this help message and exit
大部分参数已经通过蹩脚的英语做了解释,看上去应该可以理解其代表的含义,这里说一下使用时应该注意的地方:--species
(ChIPseeker默认注释结果)和--gtf
两个参数二选一即可,如果需要基因类型注释必须选择--gtf
;--label
参数可以给多样本提供样本名,顺序需要跟--input
保持一致;默认只保存图片,要保留注释文件的话记得使用--save
参数;单个样本运行时注释结果以饼图展示,多个样本一起运行时用条形图展示。
另外,饼图并没有直接ChIPseeker
里面的画图函数,想要了解详情的可以戳这里[ChIPseeker绘图函数借用]。下面来演示一下脚本的使用:
Rscript get_annopeak_chipseeker.r -i sample1_peaks.narrowPeak -g gencode.vM25.annotation.gtf -l sample1 -s
上面命令执行结束会在当前目录生成两个文件chipseeker.sample1.anno.txt
、chipseeker.sample1.anno.pdf
,分别为注释结果和饼图。
多个样本一起运行时,并选择自己想关注的基因类型和功能区间,生成条形图方便比较多个样本的结果:
Rscript get_annopeak_chipseeker.r -i sample1_peaks.narrowPeak,sample2_peaks.narrowPeak -g gencode.vM25.annotation.gtf -l sample1,sample2 -s -p chipseeker.type -a Promoter,Exon,Intron -f protein_coding,lincRNA,snRNA
上面的命令执行结束会给每个样本生成一个.anno.txt
注释结果和一个汇总所有样本的条形图。
往期回顾
linux | while + read 实现本地 or 集群批处理,实用且优雅
pyscenic | 单细胞转录因子分析,原理图文详解
一网打尽scRNA矩阵格式读取和转化(h5 h5ad loom)
ggplot2 | 开发自己的画图函数
R包安装的4种姿势