文件格式

可以是最基本的甲基化调用文件也可以是bismark下游的BAM/SAM文件。

在这里使用的是经过处理得到的文本文件。

典型的文件应该是这样的

##         chrBase   chr    base strand coverage freqC  freqT
## 1 chr21.9764539 chr21 9764539      R       12 25.00  75.00
## 2 chr21.9764513 chr21 9764513      R       12  0.00 100.00
## 3 chr21.9820622 chr21 9820622      F       13  0.00 100.00
## 4 chr21.9837545 chr21 9837545      F       11  0.00 100.00
## 5 chr21.9849022 chr21 9849022      F      124 72.58  27.42

导入文件

library(methylKit)
file.list <- list(
  "brain_1.txt",
  "brain_2.txt",
  "adrenal_1.txt",
 "adrenal_2.txt")

myobj <- methRead(
  file.list,
  sample.id=list(
    "brain_1",
    "brain_2",
    "adrenal_1",
    "adrenal_2"
    ),
  assembly="hg19",
  sep = " ",
  treatment=c(1,1,0,0),
  context="CpG",
  mincov = 10
)

1.样本描述性统计

1.1甲基化百分比的分布直方图

甲基化百分比

一般呈现两边高中间低的特点。要么高甲基化，要么低甲基化。

1.2甲基化覆盖率的直方图

甲基化覆盖率

如果数据受到了PCR扩增的影响，那么右边还会出现一个峰。

1.3根据覆盖率对位点进行筛选，去除过高或者过低的位点

2比较分析

2.1对样本进行merge

meth=unite(tile, destrand=FALSE)
head(tile)

得到在所有样本中都有cover的甲基化位点or甲基化区域（这里是甲基化区域）。

methylBase object with 6 rows
--------------
   chr  start    end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2 coverage3 numCs3 numTs3 coverage4 numCs4
1 chr1 714501 715000      *       218      3    215       204      6    198       176      5    171       168      3
2 chr1 799001 799500      *        84     78      6       245    217     28        98     68     30       144    106
3 chr1 805001 805500      *       573     26    547      1005     22    983       823     16    807      1219     28
4 chr1 809001 809500      *        76     76      0        56     54      2        46     40      6        77     68
5 chr1 839001 839500      *       132     34     98       161     52    109       157    103     54       254    170
6 chr1 839501 840000      *       281     43    238       374     56    318       327     39    288       508     73
  numTs4
1    165
2     38
3   1191
4      9
5     84
6    435
--------------
sample.ids: brain_1 brain_2 adrenal_1 adrenal_2 
destranded FALSE 
assembly: hg19 
context: CpG 
treament: 1 1 0 0 
resolution: region

2.2样本相关性分析

得到相关系数图

样本相关系数图

2.3样本聚类信息

样本聚类图

2.4 PCA分析（是什么？）

We can also do a PCA analysis on our samples. The following function will plot a scree plot for importance of components.

碎石图

We can also plot PC1 and PC2 axis and a scatter plot of our samples on those axis which will reveal how they cluster.

image.png

2.5找到差异甲基化区域

The calculateDiffMeth() function is the main function to calculate differential methylation. Depending on the sample size per each set it will either use Fisher’s exact or logistic regression to calculate P-values. P-values will be adjusted to Q-values using SLIM method (Wang, Tuominen, and Tsai 2011). If you have replicates, the function will automatically use logistic regression.
calculateDiffMeth()函数是计算差异甲基化的主要函数。根据每个组的样本多少，它将使用Fisher精确检验或逻辑回归检验来计算p值。使用SLIM方法将p值调整为q值(Wang, Tuominen, and Tsai 2011)。如果有重复样本，函数将自动使用逻辑回归检验。

原理

这个原理我也不是很懂。

myDiff=calculateDiffMeth(meth)
# get hyper methylated bases
myDiff25p.hyper=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hyper")
# get hypo methylated bases
myDiff25p.hypo=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hypo")
# get all differentially methylated bases
myDiff25p=getMethylDiff(myDiff,difference=25,qvalue=0.01)

可视化每条染色体的低甲基化/高甲基化碱基/区域的分布：

似乎报错了

为什么只有奇数的染色体呢

3.注释差异甲基化位点或者区域

得到差异甲基化的基因组分布

Summary of target set annotation with genic parts
Rows in target set: 12066
-----------------------
percentage of target features overlapping with annotation:
  promoter       exon     intron intergenic 
     14.63      26.44      60.27      31.86 

percentage of target features overlapping with annotation:
(with promoter > exon > intron precedence):
  promoter       exon     intron intergenic 
     14.63      16.91      36.61      31.86 

percentage of annotation boundaries with feature overlap:
promoter     exon   intron 
    3.91     1.36     3.16 

summary of distances to the nearest TSS:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    2890   11562   36476   36108 1718029

得到差异甲基化位点的CpG岛分布：

summary of target set annotation with feature annotation:
Rows in target set: 12066
----------------------------
percentage of target elements overlapping with features:
  CpGi shores  other 
 66.34   7.09  29.15 

percentage of feature elements overlapping with target:
  CpGi shores 
 15.25   1.96

得到距离TSS最近的基因名称

 target.row dist.to.feature feature.name feature.strand
57             1           -7315   uc001abu.1              +
92             2            3201   uc031pkn.1              +
64             3           -2155   uc031pko.1              +
64.1           4           -1655   uc031pko.1              +
106            5            1126   uc001ace.3              +
106.1          6            1626   uc001ace.3              +

读取CpG岛（CpGi）和CpG海岸注释时报错了，不过好像没有什么关系，帮助文档里也这么用。

...被弃用

a GenomicRangesList contatining one GRanges object for flanks and one for GRanges object for the main feature.
NOTE: This can not return a GRangesList at the moment because flanking regions do not have to have the same column name as the feature. GRangesList elements should resemble each other in the column content. We can not satisfy that criteria for the flanks

与内含子/外显子/启动子/CpG岛等重叠的差异甲基化区域的百分比/数目

image.png

读入甲基化调用文件之后会返回一个methylRawList对象。

但是当样本很多而且样本的规格很大的时候，可以返回不一样的对象
methylRawListDB, methylRawDB, methylBaseDB and methylDiffDB统称为methylDB objects。

Sometimes, when dealing with multiple samples and increased sample sizes coming from genome wide bisulfite sequencing experiments, the memory of your computer might not be sufficient enough.

Therefore methylKit offers a new group of classes, that are basically pendants to the original methylKit classes with one important difference: The methylation information, which normally is internally stored as data.frame, is stored in an external bgzipped file and is indexed by tabix (Li 2011), to enable fast retrieval of records or regions. This group contains methylRawListDB, methylRawDB, methylBaseDB and methylDiffDB, let us call them methylDB objects.

We can now create a methylRawListDB object, which stores the same content as myobj from above. But the single methylRaw objects retrieve their data from the tabix-file linked under dbpath.

methylkit差异甲基化分析

methylkit差异甲基化分析

文件格式

可以是最基本的甲基化调用文件也可以是bismark下游的BAM/SAM文件。

在这里使用的是经过处理得到的文本文件。

导入文件

1.样本描述性统计

1.1甲基化百分比的分布直方图

1.2甲基化覆盖率的直方图

1.3根据覆盖率对位点进行筛选，去除过高或者过低的位点

2比较分析

2.1对样本进行merge

2.2样本相关性分析

2.3样本聚类信息

2.4 PCA分析（是什么？）

2.5找到差异甲基化区域

3.注释差异甲基化位点或者区域

得到差异甲基化的基因组分布

得到差异甲基化位点的CpG岛分布：

得到距离TSS最近的基因名称

与内含子/外显子/启动子/CpG岛等重叠的差异甲基化区域的百分比/数目

推荐阅读更多精彩内容