2019-11-14学习使用ChIPseeker包

参考学习资料：ChIPseeker: an R package for ChIP peak Annotation, Comparison and Visualization
这个R包原来是作者设计用来注释ChIP seq及可视化的一个工具包，但是后面经过多次更新完善了很多新的功能：

ChIP profiling
Peak Annotation
Functional enrichment analysis
ChIP peak data set comparison
Statistical testing of ChIP seq overlap
Data Mining with ChIP seq data deposited in GEO
其实我更感兴趣的是后面2个项目，用自己的数据去测试和这个软件分析的ChIP seq overlap，还有最后一个在GEO数据库中“挖矿”。

这个包可以自己新建感兴趣的TxDb 对象，通过从 UCSC和BioMart数据库提取信息然后通过 R函数TxDbFromBiomart 和makeTxDbFromUCSC. TxDb 对象再拿来做peak annotation.
在R里面输入?makeTxDbFromBiomart()即可打开帮助文档通过实例来练习：

以下是我从帮助文档找到的例子，换上自己感兴趣的基因的转录本进行练习。

## We can use listDatasets() from the biomaRt package to list the
## datasets available in the "ENSEMBL_MART_ENSEMBL" BioMart database:
library(biomaRt)
listMarts(host="www.ensembl.org")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org")
datasets <- listDatasets(mart)
head(datasets)
subset(datasets, grepl("elegans", dataset, ignore.case=TRUE))

## Retrieve the full transcript dataset for Worm:
txdb1 <- makeTxDbFromBiomart(dataset="celegans_gene_ensembl")
txdb1

## Retrieve an incomplete transcript dataset for Human:
transcript_ids <- c(
  "ENST00000313806",
  "ENST00000443408",
  "ENST00000399272",
  "ENST00000381132",
  "ENST00000381135",
  "ENST00000620920",
  "ENST00000481448",
  "ENST00000492600"
)

if (interactive()) {
  txdb2 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
                               transcript_ids=transcript_ids)
  txdb2  # note that these annotations match the GRCh38 genome assembly
}
save(txdb2, file = "RCAN1-ALLisoform.Rdata")
## ---------------------------------------------------------------------
## B. ACCESSING THE EnsemblGenomes MARTS
## ---------------------------------------------------------------------

library(biomaRt)

## Note that BioMart is not currently available for Ensembl Bacteria.

## ---------------------
## --- Ensembl Fungi ---

mart <- useMart(biomart="fungi_mart", host="fungi.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
yeast_txdb <- makeTxDbFromBiomart(biomart="fungi_mart",
                                  dataset="scerevisiae_eg_gene",
                                  host="fungi.ensembl.org")
yeast_txdb

## Note that the dataset for Yeast on Ensembl Fungi is not necessarily
## the same as on the main Ensembl database:
yeast_txdb0 <- makeTxDbFromBiomart(dataset="scerevisiae_gene_ensembl")
all(transcripts(yeast_txdb0) %in% transcripts(yeast_txdb))
all(transcripts(yeast_txdb) %in% transcripts(yeast_txdb0))

## -----------------------
## --- Ensembl Metazoa ---

## The metazoa mart is slow and at the same time it doesn't seem to
## support requests that take more than 1 min at the moment. So a call to
## biomaRt::getBM() will fail with a "Timeout was reached" error if the
## requested data takes more than 1 min to download. This unfortunately
## happens with the example below so we don't try to run it for now.

## Not run: 
mart <- useMart(biomart="metazoa_mart", host="metazoa.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
worm_txdb <- makeTxDbFromBiomart(biomart="metazoa_mart",
                                 dataset="celegans_eg_gene",
                                 host="metazoa.ensembl.org")
worm_txdb

## Note that even if the dataset for Worm on Ensembl Metazoa contains
## the same transcript as on the main Ensembl database, the transcript
## type might be annotated with slightly different terms (e.g. antisense
## vs antisense_RNA):
filter <- list(tx_name="Y71G12B.44")
transcripts(worm_txdb, filter=filter, columns=c("tx_name", "tx_type"))
transcripts(txdb1, filter=filter, columns=c("tx_name", "tx_type"))

## End(Not run)
## ----------------------
## --- Ensembl Plants ---

## Like the metazoa mart (see above), the plants mart is also slow and
## doesn't seem to support requests that take more than 1 min either.
## So we don't try to run the example below for now.

## Not run: 
mart <- useMart(biomart="plants_mart", host="plants.ensembl.org")
datasets <- listDatasets(mart)
datasets[ , 1:2]
athaliana_txdb <- makeTxDbFromBiomart(biomart="plants_mart",
                                      dataset="athaliana_eg_gene",
                                      host="plants.ensembl.org")
athaliana_txdb

## End(Not run)
## ------------------------
## --- Ensembl Protists ---

mart <- useMart(biomart="protists_mart", host="protists.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
tgondii_txdb <- makeTxDbFromBiomart(biomart="protists_mart",
                                    dataset="tgondii_eg_gene",
                                    host="protists.ensembl.org")
tgondii_txdb

## ---------------------------------------------------------------------
## C. USING AN Ensembl MIRROR
## ---------------------------------------------------------------------

## You can use the 'host' argument to access the "ENSEMBL_MART_ENSEMBL"
## BioMart database at a mirror (e.g. at uswest.ensembl.org). A gotcha
## when doing this is that the name of the database on the mirror might
## be different! We can check this with listMarts() from the biomaRt
## package:
listMarts(host="useast.ensembl.org")

## Therefore in addition to setting 'host' to "uswest.ensembl.org" we
## might also need to specify the 'biomart' argument:
if (interactive()) {
  txdb3 <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",
                               dataset="hsapiens_gene_ensembl",
                               transcript_ids=transcript_ids,
                               host="useast.ensembl.org")
  txdb3
}

## ---------------------------------------------------------------------
## D. USING FILTERS
## ---------------------------------------------------------------------

## We can use listFilters() from the biomaRt package to get valid filter
## names:
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                dataset="hsapiens_gene_ensembl",
                host="www.ensembl.org")
head(listFilters(mart))

## Retrieve transcript dataset for Ensembl gene ENSG00000011198:
my_filter <- list(ensembl_gene_id="ENSG00000011198")

if (interactive()) {
  txdb4 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
                               filter=my_filter)
  txdb4
  transcripts(txdb4, columns=c("tx_id", "tx_name", "gene_id"))
  transcriptLengths(txdb4)
}

## ---------------------------------------------------------------------
## E. RETRIEVING CHROMOSOME INFORMATION ONLY
## ---------------------------------------------------------------------

chrominfo <- getChromInfoFromBiomart(dataset="celegans_gene_ensembl")
chrominfo

实例主要是通过线虫来演示，可能是因为这个生物的基因组小，数据量少的缘故。总的来讲还是不错的，后面还需要勤加练习。
这个包的可视化做的真的很棒例如用以下代码实现叠加韦恩图的效果：

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(clusterProfiler)
files <- getSampleFiles()
print(files)

peak <- readPeakFile(files[[4]])
peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")

plotAnnoPie(peakAnno)
plotAnnoBar(peakAnno)
vennpie(peakAnno)
#BiocManager::install("ggupset")
#BiocManager::install("ggplot2")
#BiocManager::install("ggimage")
library(ggimage)
library(ggupset)
library(ggplot2)
upsetplot(peakAnno)
upsetplot(peakAnno, vennpie=TRUE)

叠加韦恩图的效果

关于富集分析ChIPseeker 还提供一个函数, seq2gene, 可以完成连接基因区域到基因多对多的映射. 作者考虑到host gene (exon/intron), promoter region and flanking gene from intergenic region 这些区域可能通过cis-regulation来调控。这个函数的设计初衷应该是联系基因组上编码和非编码基因之间的功能预测。

因为生物研究领域富集分析运用广泛，这个包的作者Y叔也开发了一些Bioconductor packages来进行分析这浩如烟海的数据，例如可以探究是否筛选的基因与某个特殊的生物学过程是相关的分析工具：包括 DOSE(Yu et al. 2015) 做疾病富集分析 Disease Ontology, ReactomePA 做 reactome pathway, 还有鼎鼎大名的clusterProfiler(Yu et al. 2012) 做的 Gene Ontology 和 KEGG 富集分析.

下一个重点就是注释了
需要先下载ReactomePA包
这个包有个.db文件有点大，下载之前必须要配置好镜像。否则可能导致失败。
下载后我查看了一下结果如下

Last login: Thu Nov 14 07:45:10 on console
(base) Cheng-MBP:~ chelsea$ cd /var/folders/tm/q03dw2_n18v2v6nlsw81yhc80000gn/T//RtmpLJTJI5/downloaded_packages
(base) Cheng-MBP:downloaded_packages chelsea$ ls
ChIPseeker_1.22.0.tgz     ggplot2_3.2.1.tgz         magick_2.2.tgz
ReactomePA_1.30.0.tgz     ggupset_0.1.0.tgz         reactome.db_1.70.0.tar.gz
ggimage_0.2.4.tgz         graphite_1.32.0.tgz
(base) Cheng-MBP:downloaded_packages chelsea$ ll
total 964864
-rw-r--r--  1 chelsea  staff    3671144 Nov 14 08:00 ChIPseeker_1.22.0.tgz
-rw-r--r--  1 chelsea  staff    1419743 Nov 14 11:40 ReactomePA_1.30.0.tgz
-rw-r--r--  1 chelsea  staff     476247 Nov 14 11:07 ggimage_0.2.4.tgz
-rw-r--r--  1 chelsea  staff    3973186 Nov 14 10:59 ggplot2_3.2.1.tgz
-rw-r--r--  1 chelsea  staff    1416035 Nov 14 10:57 ggupset_0.1.0.tgz
-rw-r--r--  1 chelsea  staff     844840 Nov 14 11:38 graphite_1.32.0.tgz
-rw-r--r--  1 chelsea  staff   27211445 Nov 14 11:07 magick_2.2.tgz
-rw-r--r--  1 chelsea  staff  454979427 Nov 14 11:42 reactome.db_1.70.0.tar.gz

总共有400多兆，配置镜像相关可以参考https://mp.weixin.qq.com/s/nf2bzkBAkWp4W-KB5uBj2g

然后是注释结果可视化

#BiocManager::install("ReactomePA")
library(ReactomePA)

pathway1 <- enrichPathway(as.data.frame(peakAnno)$geneId)
head(pathway1, 2)

gene <- seq2gene(peak, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathway2 <- enrichPathway(gene)
head(pathway2, 2)
dotplot(pathway2)

dot图可视化

然后是对peak进行比较分析

## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrixList <- lapply(files, getTagMatrix, windows=promoter)
##
## to speed up the compilation of this vigenette, we load a precaculated tagMatrixList
data("tagMatrixList")
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))

比较结果

分面显示多个比较结果置信区间参数设置为conf=0.95

plotAvgProf(tagMatrixList, xlim=c(-3000, 3000), conf=0.95,resample=500, facet="row")

多个比较结果展示

还可用热图展示：

tagHeatmap(tagMatrixList, xlim=c(-3000, 3000), color=NULL)

可能是画布的设置有点问题，没有能显示全面

除此之外还有对注释后的比较

有2个函数可以接受peaks注释后的列表信息(output of annotatePeak)：plotAnnoBar 和 plotDistToTSS .

peakAnnoList <- lapply(files, annotatePeak, TxDb=txdb,
                       tssRegion=c(-3000, 3000), verbose=FALSE)

接下来科研用plotAnnoBar函数来比较genomic annotation.

plotAnnoBar(peakAnnoList)

比较结果如下图所示：

`plotAnnoBar`函数比对结果

R 函数plotDistToTSS 可以以用来展示与TSS 的距离数据。也就是展示转录因子结合位点在TSS附近的分布情况。

plotDistToTSS(peakAnnoList)

plotDistToTSS函数可视化的结果

然后是功能分析：

genes = lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
names(genes) = sub("_", "\n", names(genes))
compKEGG <- compareCluster(geneCluster   = genes,
                           fun           = "enrichKEGG",
                           pvalueCutoff  = 0.05,
                           pAdjustMethod = "BH")
dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")

这个功能分析还做了比较

寻找peaks和注释的基因的overlap

genes= lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
vennplot(genes)

peaks和注释的基因的overlap

作者认为寻找overlap非常重要

Overlap is very important, if two ChIP experiment by two different proteins overlap in a large fraction of their peaks, they may cooperative in regulation. Calculating the overlap is only touch the surface. ChIPseeker implemented statistical methods to measure the significance of the overlap.

此外，找到重叠仅仅是流于表面的东西，更重要的是这个软件可以帮助计算显著性，这样得到的结果才可靠。

> p <- GRanges(seqnames=c("chr1", "chr3"),
+              ranges=IRanges(start=c(1, 100), end=c(50, 130)))
> shuffle(p, TxDb=txdb)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr1 227521923-227521972      *
  [2]     chr3     1960405-1960435      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
> enrichPeakOverlap(queryPeak     = files[[5]],
+                   targetPeak    = unlist(files[1:4]),
+                   TxDb          = txdb,
+                   pAdjustMethod = "BH",
+                   nShuffle      = 50,
+                   chainFile     = NULL,
+                   verbose       = FALSE)
                                                      qSample                                            tSample qLen
ARmo_0M    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz                    GSM1174480_ARmo_0M_peaks.bed.gz 1663
ARmo_1nM   GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz                   GSM1174481_ARmo_1nM_peaks.bed.gz 1663
ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz                 GSM1174482_ARmo_100nM_peaks.bed.gz 1663
CBX6_BF    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663
           tLen N_OL     pvalue   p.adjust
ARmo_0M     812    0 0.88235294 0.88235294
ARmo_1nM   2296    8 0.17647059 0.35294118
ARmo_100nM 1359    3 0.33333333 0.44444444
CBX6_BF    1331  968 0.01960784 0.07843137

可以看到结果显示，CBX6_BF分析结果是有显著性的。
作者讲他们收集了目前GEO库中17,000 bed文件，可以通过getGEOspecies()函数得到他们的物种相关总结性的信息。

也可以通过以下函数下载bed文件相关信息

> downloadGEObedFiles(genome="hg19", destDir="hg19")
There were 50 or more warnings (use warnings() to see the first 50)
> gsm <- hg19$gsm[sample(nrow(hg19), 10)]
> downloadGSMbedFiles(gsm, destDir="hg19")
There were 36 warnings (use warnings() to see them)

至于那些警告是什么我也没有继续探究，提示什么非零，curl.rb的问题，我完全不知道在说啥。不过暂时也没时间去探究了。