2019-11-14学习使用ChIPseeker包

参考学习资料:ChIPseeker: an R package for ChIP peak Annotation, Comparison and Visualization
这个R包原来是作者设计用来注释ChIP seq及可视化的一个工具包,但是后面经过多次更新完善了很多新的功能:

这个包可以自己新建感兴趣的TxDb 对象,通过从 UCSC和BioMart数据库提取信息然后通过 R函数TxDbFromBiomartmakeTxDbFromUCSC. TxDb 对象再拿来做peak annotation.
在R里面输入?makeTxDbFromBiomart()即可打开帮助文档通过实例来练习:

以下是我从帮助文档找到的例子,换上自己感兴趣的基因的转录本进行练习。

## We can use listDatasets() from the biomaRt package to list the
## datasets available in the "ENSEMBL_MART_ENSEMBL" BioMart database:
library(biomaRt)
listMarts(host="www.ensembl.org")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org")
datasets <- listDatasets(mart)
head(datasets)
subset(datasets, grepl("elegans", dataset, ignore.case=TRUE))

## Retrieve the full transcript dataset for Worm:
txdb1 <- makeTxDbFromBiomart(dataset="celegans_gene_ensembl")
txdb1

## Retrieve an incomplete transcript dataset for Human:
transcript_ids <- c(
  "ENST00000313806",
  "ENST00000443408",
  "ENST00000399272",
  "ENST00000381132",
  "ENST00000381135",
  "ENST00000620920",
  "ENST00000481448",
  "ENST00000492600"
)

if (interactive()) {
  txdb2 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
                               transcript_ids=transcript_ids)
  txdb2  # note that these annotations match the GRCh38 genome assembly
}
save(txdb2, file = "RCAN1-ALLisoform.Rdata")
## ---------------------------------------------------------------------
## B. ACCESSING THE EnsemblGenomes MARTS
## ---------------------------------------------------------------------

library(biomaRt)

## Note that BioMart is not currently available for Ensembl Bacteria.

## ---------------------
## --- Ensembl Fungi ---

mart <- useMart(biomart="fungi_mart", host="fungi.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
yeast_txdb <- makeTxDbFromBiomart(biomart="fungi_mart",
                                  dataset="scerevisiae_eg_gene",
                                  host="fungi.ensembl.org")
yeast_txdb

## Note that the dataset for Yeast on Ensembl Fungi is not necessarily
## the same as on the main Ensembl database:
yeast_txdb0 <- makeTxDbFromBiomart(dataset="scerevisiae_gene_ensembl")
all(transcripts(yeast_txdb0) %in% transcripts(yeast_txdb))
all(transcripts(yeast_txdb) %in% transcripts(yeast_txdb0))

## -----------------------
## --- Ensembl Metazoa ---

## The metazoa mart is slow and at the same time it doesn't seem to
## support requests that take more than 1 min at the moment. So a call to
## biomaRt::getBM() will fail with a "Timeout was reached" error if the
## requested data takes more than 1 min to download. This unfortunately
## happens with the example below so we don't try to run it for now.

## Not run: 
mart <- useMart(biomart="metazoa_mart", host="metazoa.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
worm_txdb <- makeTxDbFromBiomart(biomart="metazoa_mart",
                                 dataset="celegans_eg_gene",
                                 host="metazoa.ensembl.org")
worm_txdb

## Note that even if the dataset for Worm on Ensembl Metazoa contains
## the same transcript as on the main Ensembl database, the transcript
## type might be annotated with slightly different terms (e.g. antisense
## vs antisense_RNA):
filter <- list(tx_name="Y71G12B.44")
transcripts(worm_txdb, filter=filter, columns=c("tx_name", "tx_type"))
transcripts(txdb1, filter=filter, columns=c("tx_name", "tx_type"))

## End(Not run)
## ----------------------
## --- Ensembl Plants ---

## Like the metazoa mart (see above), the plants mart is also slow and
## doesn't seem to support requests that take more than 1 min either.
## So we don't try to run the example below for now.

## Not run: 
mart <- useMart(biomart="plants_mart", host="plants.ensembl.org")
datasets <- listDatasets(mart)
datasets[ , 1:2]
athaliana_txdb <- makeTxDbFromBiomart(biomart="plants_mart",
                                      dataset="athaliana_eg_gene",
                                      host="plants.ensembl.org")
athaliana_txdb

## End(Not run)
## ------------------------
## --- Ensembl Protists ---

mart <- useMart(biomart="protists_mart", host="protists.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
tgondii_txdb <- makeTxDbFromBiomart(biomart="protists_mart",
                                    dataset="tgondii_eg_gene",
                                    host="protists.ensembl.org")
tgondii_txdb

## ---------------------------------------------------------------------
## C. USING AN Ensembl MIRROR
## ---------------------------------------------------------------------

## You can use the 'host' argument to access the "ENSEMBL_MART_ENSEMBL"
## BioMart database at a mirror (e.g. at uswest.ensembl.org). A gotcha
## when doing this is that the name of the database on the mirror might
## be different! We can check this with listMarts() from the biomaRt
## package:
listMarts(host="useast.ensembl.org")

## Therefore in addition to setting 'host' to "uswest.ensembl.org" we
## might also need to specify the 'biomart' argument:
if (interactive()) {
  txdb3 <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",
                               dataset="hsapiens_gene_ensembl",
                               transcript_ids=transcript_ids,
                               host="useast.ensembl.org")
  txdb3
}

## ---------------------------------------------------------------------
## D. USING FILTERS
## ---------------------------------------------------------------------

## We can use listFilters() from the biomaRt package to get valid filter
## names:
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                dataset="hsapiens_gene_ensembl",
                host="www.ensembl.org")
head(listFilters(mart))

## Retrieve transcript dataset for Ensembl gene ENSG00000011198:
my_filter <- list(ensembl_gene_id="ENSG00000011198")

if (interactive()) {
  txdb4 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
                               filter=my_filter)
  txdb4
  transcripts(txdb4, columns=c("tx_id", "tx_name", "gene_id"))
  transcriptLengths(txdb4)
}

## ---------------------------------------------------------------------
## E. RETRIEVING CHROMOSOME INFORMATION ONLY
## ---------------------------------------------------------------------

chrominfo <- getChromInfoFromBiomart(dataset="celegans_gene_ensembl")
chrominfo

实例主要是通过线虫来演示,可能是因为这个生物的基因组小,数据量少的缘故。总的来讲还是不错的,后面还需要勤加练习。
这个包的可视化做的真的很棒例如用以下代码实现叠加韦恩图的效果:

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(clusterProfiler)
files <- getSampleFiles()
print(files)

peak <- readPeakFile(files[[4]])
peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")

plotAnnoPie(peakAnno)
plotAnnoBar(peakAnno)
vennpie(peakAnno)
#BiocManager::install("ggupset")
#BiocManager::install("ggplot2")
#BiocManager::install("ggimage")
library(ggimage)
library(ggupset)
library(ggplot2)
upsetplot(peakAnno)
upsetplot(peakAnno, vennpie=TRUE)
叠加韦恩图的效果

关于富集分析ChIPseeker 还提供一个函数, seq2gene, 可以完成连接基因区域到基因多对多的映射. 作者考虑到host gene (exon/intron), promoter region and flanking gene from intergenic region 这些区域可能通过cis-regulation来调控。 这个函数的设计初衷应该是联系基因组上编码和非编码基因之间的功能预测。

因为生物研究领域富集分析运用广泛,这个包的作者Y叔也开发了一些Bioconductor packages来进行分析这浩如烟海的数据,例如可以探究是否筛选的基因与某个特殊的生物学过程是相关的分析工具:包括 DOSE(Yu et al. 2015) 做疾病富集分析 Disease Ontology, ReactomePA 做 reactome pathway, 还有鼎鼎大名的clusterProfiler(Yu et al. 2012) 做的 Gene Ontology 和 KEGG 富集分析.

下一个重点就是注释了
需要先下载ReactomePA
这个包有个.db文件有点大,下载之前必须要配置好镜像。否则可能导致失败。
下载后我查看了一下结果如下

Last login: Thu Nov 14 07:45:10 on console
(base) Cheng-MBP:~ chelsea$ cd /var/folders/tm/q03dw2_n18v2v6nlsw81yhc80000gn/T//RtmpLJTJI5/downloaded_packages
(base) Cheng-MBP:downloaded_packages chelsea$ ls
ChIPseeker_1.22.0.tgz     ggplot2_3.2.1.tgz         magick_2.2.tgz
ReactomePA_1.30.0.tgz     ggupset_0.1.0.tgz         reactome.db_1.70.0.tar.gz
ggimage_0.2.4.tgz         graphite_1.32.0.tgz
(base) Cheng-MBP:downloaded_packages chelsea$ ll
total 964864
-rw-r--r--  1 chelsea  staff    3671144 Nov 14 08:00 ChIPseeker_1.22.0.tgz
-rw-r--r--  1 chelsea  staff    1419743 Nov 14 11:40 ReactomePA_1.30.0.tgz
-rw-r--r--  1 chelsea  staff     476247 Nov 14 11:07 ggimage_0.2.4.tgz
-rw-r--r--  1 chelsea  staff    3973186 Nov 14 10:59 ggplot2_3.2.1.tgz
-rw-r--r--  1 chelsea  staff    1416035 Nov 14 10:57 ggupset_0.1.0.tgz
-rw-r--r--  1 chelsea  staff     844840 Nov 14 11:38 graphite_1.32.0.tgz
-rw-r--r--  1 chelsea  staff   27211445 Nov 14 11:07 magick_2.2.tgz
-rw-r--r--  1 chelsea  staff  454979427 Nov 14 11:42 reactome.db_1.70.0.tar.gz

总共有400多兆,配置镜像相关可以参考https://mp.weixin.qq.com/s/nf2bzkBAkWp4W-KB5uBj2g

然后是注释结果可视化

#BiocManager::install("ReactomePA")
library(ReactomePA)

pathway1 <- enrichPathway(as.data.frame(peakAnno)$geneId)
head(pathway1, 2)

gene <- seq2gene(peak, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathway2 <- enrichPathway(gene)
head(pathway2, 2)
dotplot(pathway2)
dot图可视化

然后是对peak进行比较分析

## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrixList <- lapply(files, getTagMatrix, windows=promoter)
##
## to speed up the compilation of this vigenette, we load a precaculated tagMatrixList
data("tagMatrixList")
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))

比较结果

分面显示多个比较结果置信区间参数设置为conf=0.95

plotAvgProf(tagMatrixList, xlim=c(-3000, 3000), conf=0.95,resample=500, facet="row")
多个比较结果展示

还可用热图展示:

tagHeatmap(tagMatrixList, xlim=c(-3000, 3000), color=NULL)
可能是画布的设置有点问题,没有能显示全面

除此之外还有对注释后的比较

有2个函数可以接受peaks注释后的列表信息(output of annotatePeak):plotAnnoBarplotDistToTSS .

peakAnnoList <- lapply(files, annotatePeak, TxDb=txdb,
                       tssRegion=c(-3000, 3000), verbose=FALSE)

接下来科研用plotAnnoBar函数来比较genomic annotation.

plotAnnoBar(peakAnnoList)

比较结果如下图所示:

`plotAnnoBar`函数比对结果

R 函数plotDistToTSS 可以以用来展示与TSS 的距离数据。也就是展示转录因子结合位点在TSS附近的分布情况。

plotDistToTSS(peakAnnoList)
plotDistToTSS函数可视化的结果

然后是功能分析:

genes = lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
names(genes) = sub("_", "\n", names(genes))
compKEGG <- compareCluster(geneCluster   = genes,
                           fun           = "enrichKEGG",
                           pvalueCutoff  = 0.05,
                           pAdjustMethod = "BH")
dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")
这个功能分析还做了比较

寻找peaks和注释的基因的overlap

genes= lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
vennplot(genes)
peaks和注释的基因的overlap

作者认为寻找overlap非常重要

Overlap is very important, if two ChIP experiment by two different proteins overlap in a large fraction of their peaks, they may cooperative in regulation. Calculating the overlap is only touch the surface. ChIPseeker implemented statistical methods to measure the significance of the overlap.

此外,找到重叠仅仅是流于表面的东西,更重要的是这个软件可以帮助计算显著性,这样得到的结果才可靠。

> p <- GRanges(seqnames=c("chr1", "chr3"),
+              ranges=IRanges(start=c(1, 100), end=c(50, 130)))
> shuffle(p, TxDb=txdb)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr1 227521923-227521972      *
  [2]     chr3     1960405-1960435      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
> enrichPeakOverlap(queryPeak     = files[[5]],
+                   targetPeak    = unlist(files[1:4]),
+                   TxDb          = txdb,
+                   pAdjustMethod = "BH",
+                   nShuffle      = 50,
+                   chainFile     = NULL,
+                   verbose       = FALSE)
                                                      qSample                                            tSample qLen
ARmo_0M    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz                    GSM1174480_ARmo_0M_peaks.bed.gz 1663
ARmo_1nM   GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz                   GSM1174481_ARmo_1nM_peaks.bed.gz 1663
ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz                 GSM1174482_ARmo_100nM_peaks.bed.gz 1663
CBX6_BF    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663
           tLen N_OL     pvalue   p.adjust
ARmo_0M     812    0 0.88235294 0.88235294
ARmo_1nM   2296    8 0.17647059 0.35294118
ARmo_100nM 1359    3 0.33333333 0.44444444
CBX6_BF    1331  968 0.01960784 0.07843137

可以看到结果显示,CBX6_BF分析结果是有显著性的。
作者讲他们收集了目前GEO库中17,000 bed文件,可以通过getGEOspecies()函数得到他们的物种相关总结性的信息。

也可以通过以下函数下载bed文件相关信息

> downloadGEObedFiles(genome="hg19", destDir="hg19")
There were 50 or more warnings (use warnings() to see the first 50)
> gsm <- hg19$gsm[sample(nrow(hg19), 10)]
> downloadGSMbedFiles(gsm, destDir="hg19")
There were 36 warnings (use warnings() to see them)

至于那些警告是什么我也没有继续探究,提示什么非零,curl.rb的问题,我完全不知道在说啥。不过暂时也没时间去探究了。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,657评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,662评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,143评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,732评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,837评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,036评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,126评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,868评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,315评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,641评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,773评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,470评论 4 333
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,126评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,859评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,095评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,584评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,676评论 2 351