上一期我们使用了Guitar包对Peak结果进行可视化,今天展示另一种可视化方法:使用metaPlotR包。
这个包将一些bash命令以及位置处理信息封装在了perl脚本中,然后使用R进行了可视化。下载地址:https://github.com/olarerin/metaPlotR。
下载下来后解压然后利用其中的脚本就可以了。
这一期如果linux基础不太好的不要轻易尝试这个方法,可以先稳固一下基础再来
一、数据准备Peak结果
Peak结果bed格式,为bed6,并且需要进行排序。
排序命令:
# 创建文件夹
mkdir metaPlotR
# 得到bed6并且排序
# -k1,1 表示只对第一列进行排序
# -k2,2n 表示只对第二列按照数字进行排序
# 先检查命令
ls */Mod.bed |awk -F'/' '{print"cat "$0 " | cut -f 1-6 | sort -k1,1 -k2,2n >metaPlotR/" $1 ".sorted.bed"}'
cat KO1/Mod.bed | cut -f 1-6 | sort -k1,1 -k2,2n >metaPlotR/KO1.sorted.bed
cat KO2/Mod.bed | cut -f 1-6 | sort -k1,1 -k2,2n >metaPlotR/KO2.sorted.bed
cat KO3/Mod.bed | cut -f 1-6 | sort -k1,1 -k2,2n >metaPlotR/KO3.sorted.bed
cat WT1/Mod.bed | cut -f 1-6 | sort -k1,1 -k2,2n >metaPlotR/WT1.sorted.bed
cat WT2/Mod.bed | cut -f 1-6 | sort -k1,1 -k2,2n >metaPlotR/WT2.sorted.bed
cat WT3/Mod.bed | cut -f 1-6 | sort -k1,1 -k2,2n >metaPlotR/WT3.sorted.bed
# 执行
ls */Mod.bed |awk -F'/' '{print"cat "$0 " | cut -f 1-6 | sort -k1,1 -k2,2n >metaPlotR/" $1 ".sorted.bed"}' |sh
二、参考基因组注释文件
1)下载GRCm39 fa文件
前面我们使用的是ENSEMBL数据库的GRCm39的参考基因组,去这里下载对应UCSC的fa文件:http://hgdownload.soe.ucsc.edu/downloads.html
下载地址:http://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.chromFa.tar.gz
下载完后解压到文件夹:chroms
2)制作GRCm39_annot.bed文件
首先去UCSC下载:the extended gene prediction tables
选择参数如下:
!!!多下载一次 两次下载大小完全一致即可保证下载完整了。
关于这个地方,我们之前用的ENSEMBL的gtf文件,这个页面用的GENCODE的,不必担心这两者的差异,看官网对于他们的注释异同:
What is the difference between GENCODE and Ensembl annotation?
The GENCODE annotation is made by merging the Havana manual gene annotation and the Ensembl automated gene annotation. The GENCODE annotation is the default gene annotation displayed in the Ensembl browser. The GENCODE releases coincide with the Ensembl releases, although we can skip an Ensembl release is there is no update to the annotation with respect to the previous release. In practical terms, the GENCODE annotation is identical to the Ensembl annotation.
简而言之,这两者就是一样的。
创建转录组中每个核苷酸的主注释文件
# 创建转录组中每个核苷酸的主注释文件
# chroms/为刚刚解压后的文件
perl make_annot_bed.pl --genomeDir chroms/ --genePred GRCm39_gencode.genePred > GRCm39_annot.bed
# 对基因bed文件进行排序
sort -k1,1 -k2,2n GRCm39_annot.bed > GRCm39_annot.sorted.bed
sed -i 's/chr//g' GRCm39_annot.sorted.bed
# 生成转录区域(即5'UTR、CDS和3'UTR)的起始和结束位点的转录组坐标的文件
# 它将排序后的主注释文件作为输入(GRCm39_annotation.sorted.bed),并输出一个区域注释文件。
# 区域注释文件是确定查询位点与转录组特征(即转录起始位点、起始密码子、终止密码子和转录结束)的距离所必需的。
# creates the transcript regions (i.e. 5’UTR, CDS and 3’UTR)
perl size_of_cds_utrs.pl --annot GRCm39_annot.sorted.bed > region_sizes.txt
对每个排序后的Peak的bed文件进行注释
# 注释
ls metaPlotR/*bed |perl -ne 'chomp;/metaPlotR\/(.*)/;print"perl annotate_bed_file.pl --bed $_ --bed2 GRCm39_annot.sorted.bed >metaPlotR/annot_$1\n";' >anno1.sh
nohup sh anno1.sh >anno1.log &
# qsub anno1.sh
# anno1.sh内容为下:
perl annotate_bed_file.pl --bed metaPlotR/KO1.sorted.bed --bed2 GRCm39_annot.sorted.bed >metaPlotR/annot_KO1.sorted.bed
perl annotate_bed_file.pl --bed metaPlotR/KO2.sorted.bed --bed2 GRCm39_annot.sorted.bed >metaPlotR/annot_KO2.sorted.bed
perl annotate_bed_file.pl --bed metaPlotR/KO3.sorted.bed --bed2 GRCm39_annot.sorted.bed >metaPlotR/annot_KO3.sorted.bed
perl annotate_bed_file.pl --bed metaPlotR/WT1.sorted.bed --bed2 GRCm39_annot.sorted.bed >metaPlotR/annot_WT1.sorted.bed
perl annotate_bed_file.pl --bed metaPlotR/WT2.sorted.bed --bed2 GRCm39_annot.sorted.bed >metaPlotR/annot_WT2.sorted.bed
perl annotate_bed_file.pl --bed metaPlotR/WT3.sorted.bed --bed2 GRCm39_annot.sorted.bed >metaPlotR/annot_WT3.sorted.bed
#运行
# 识别用户提供的位点所在的转录本区域,并将转录组坐标转换为元基因坐标。
# 即,出现在5 ' utr中的位点的值从0到1,其中0和1分别代表5 ' utr的5 '和3 '末端。
# 类似地,CDS中的位点值从1到2,3 ' utr值从2到3。
# 脚本接受带注释的查询文件annot_miclip.cims作为输入。Bed和区域注释文件utr_cds_ends.txt。
# 输出的距离度量文件包含绘制元图所需的所有值
ls metaPlotR/anno*bed |perl -ne 'chomp;/annot_(.*).sorted/;print"perl rel_and_abs_dist_calc.pl --bed $_ --regions region_sizes.txt > metaPlotR/$1.m6a.dist.measures.txt\n";' >anno2.sh
nohup sh anno2.sh >anno2.log &
# qsub anno2.sh
# anno2.sh的内容为下:
perl rel_and_abs_dist_calc.pl --bed metaPlotR/annot_KO1.sorted.bed --regions region_sizes.txt > metaPlotR/KO1.m6a.dist.measures.txt
perl rel_and_abs_dist_calc.pl --bed metaPlotR/annot_KO2.sorted.bed --regions region_sizes.txt > metaPlotR/KO2.m6a.dist.measures.txt
perl rel_and_abs_dist_calc.pl --bed metaPlotR/annot_KO3.sorted.bed --regions region_sizes.txt > metaPlotR/KO3.m6a.dist.measures.txt
perl rel_and_abs_dist_calc.pl --bed metaPlotR/annot_WT1.sorted.bed --regions region_sizes.txt > metaPlotR/WT1.m6a.dist.measures.txt
perl rel_and_abs_dist_calc.pl --bed metaPlotR/annot_WT2.sorted.bed --regions region_sizes.txt > metaPlotR/WT2.m6a.dist.measures.txt
perl rel_and_abs_dist_calc.pl --bed metaPlotR/annot_WT3.sorted.bed --regions region_sizes.txt > metaPlotR/WT3.m6a.dist.measures.txt
至此,每个样本的m6A Peak位置就已经转化为转录本上的坐标以及对应的Peak Number了:*.m6a.dist.measures.txt
tree metaPlotR
metaPlotR/
├── annot_KO1.sorted.bed
├── annot_KO2.sorted.bed
├── annot_KO3.sorted.bed
├── annot_WT1.sorted.bed
├── annot_WT2.sorted.bed
├── annot_WT3.sorted.bed
├── KO1.m6a.dist.measures.txt
├── KO1.sorted.bed
├── KO2.m6a.dist.measures.txt
├── KO2.sorted.bed
├── KO3.m6a.dist.measures.txt
├── KO3.sorted.bed
├── WT1.m6a.dist.measures.txt
├── WT1.sorted.bed
├── WT2.m6a.dist.measures.txt
├── WT2.sorted.bed
├── WT3.m6a.dist.measures.txt
└── WT3.sorted.bed
三、接下来是使用R语言进行可视化操作部分。
R传参脚本如下:metaPlot.R
rm(list=ls())
options(stringsAsFactors = F)
library(getopt)
spec <- matrix(c(
'help', 'h', 0, "logical",
'dist', 'd', 1, "character",
'name', 'n', 1, "character",
'od', 'o', 1, "character"),
byrow = TRUE, ncol = 4)
opt <- getopt(spec)
## usages
print_usage <- function(spec=NULL){
cat("
Usage example:
Rscript metaPlot.R --dist m6a.dist.measures.txt --name m6a --od ./
Options:
--help -h NULL get this help
--dist character m6a.dist.measures.txt [forced]
--name character sample name
--od character outdir [forced]
\n")
q(status=1)
}
## default paramter setting
if (!is.null(opt$help) |is.null(opt$dist) |is.null(opt$od)) { print_usage(spec) }
if (!file.exists(opt$od)) dir.create(opt$od)
# 加载包
library(ggplot2)
library(scales)
# 读取数据
m6a.dist <- read.delim (opt$dist, header = T)
dim(m6a.dist)
head(m6a.dist)
# Determine longest length transcript for each gene
trx_len <- m6a.dist$utr5_size + m6a.dist$cds_size + m6a.dist$utr3_size
temp <- data.frame(m6a.dist$gene_name, m6a.dist$refseqID, trx_len)
colnames(temp) <- c("gene_name", "gid", "trx_len")
temp.df <- temp[order(temp$gene_name, temp$gid, -temp$trx_len),]
temp.df <- temp[!duplicated(temp$gene_name),]
# limit m6a data to one transcript per gene (longest)
m6a.dist <- m6a.dist[m6a.dist$refseqID %in% temp.df$gid,]
# View size of our dataset (rows, columns)
dim(m6a.dist)
# re-scale the widths of the 5'UTR and 3'UTR relative to the CDS
utr5.SF <- median(m6a.dist$utr5_size, na.rm = T)/median(m6a.dist$cds_size, na.rm = T)
utr3.SF <- median(m6a.dist$utr3_size, na.rm = T)/median(m6a.dist$cds_size, na.rm = T)
# assign the regions to new dataframes
utr5.m6a.dist <- m6a.dist[m6a.dist$rel_location < 1, ]
cds.m6a.dist <- m6a.dist [m6a.dist$rel_location < 2 & m6a.dist$rel_location >= 1, ]
utr3.m6a.dist <- m6a.dist[m6a.dist$rel_location >= 2, ]
# rescale 5'UTR and 3'UTR
utr5.m6a.dist$rel_location <- rescale(utr5.m6a.dist$rel_location, to = c(1-utr5.SF, 1), from = c(0,1))
utr3.m6a.dist$rel_location <- rescale(utr3.m6a.dist$rel_location, to = c(2, 2+utr3.SF), from = c(2,3))
m6a.metagene.coord <- c(utr5.m6a.dist$rel_location, cds.m6a.dist$rel_location, utr3.m6a.dist$rel_location)
df <- data.frame(m6a.metagene.coord = m6a.metagene.coord)
##---------------- Visualizing the metagene
# A smooth density plot
setwd(opt$od)
p <- ggplot(df,aes(x =m6a.metagene.coord)) + geom_density(aes(colour="red")) + geom_vline(xintercept = 1:2, col = "grey") +
theme_bw() + xlim(0, 3) + theme(legend.position="none")
p
png(filename = paste0(opt$name,".png"),width = 800,height = 600,res=200)
print(p)
dev.off()
write.table(m6a.metagene.coord, file = paste0(opt$name,"_m6a.metagene.coord.txt"),row.names = F,sep = "\t",quote = F)
运行:
ls metaPlotR/*m6a.dist.measures.txt |perl -ne 'chomp;/metaPlotR\/(.*).m6a.dist.measures/;print"Rscript metaPlot.R --dist $_ --name $1 --od metaPlotR/\n";' >m6a_plot.sh
#sh脚本如下:
Rscript metaPlot.R --dist metaPlotR/KO1.m6a.dist.measures.txt --name KO1 --od metaPlotR/
Rscript metaPlot.R --dist metaPlotR/KO2.m6a.dist.measures.txt --name KO2 --od metaPlotR/
Rscript metaPlot.R --dist metaPlotR/KO3.m6a.dist.measures.txt --name KO3 --od metaPlotR/
Rscript metaPlot.R --dist metaPlotR/WT1.m6a.dist.measures.txt --name WT1 --od metaPlotR/
Rscript metaPlot.R --dist metaPlotR/WT2.m6a.dist.measures.txt --name WT2 --od metaPlotR/
Rscript metaPlot.R --dist metaPlotR/WT3.m6a.dist.measures.txt --name WT3 --od metaPlotR/
# 运行
nohup sh m6a_plot.sh >m6a_plot.sh.log &
#qsub m6a_plot.sh
可视化图中:0 to 1:表示5'UTR;1 to 2:表示CDS;2 to 3:表示3'UTR
贴上两个样本的结果如下:
也可以将多个样本绘制在一起:这里选取两个样本示例,用到上一步绘图的输出结果*_m6a.metagene.coord.txt
library(ggplot2)
## ---------------------------Comparing multiple metagene plots
# Read in metagene coordinates
KO1.metagene.coord <- t(read.delim ("KO1_m6a.metagene.coord.txt", header = T))
WT1.metagene.coord <- t(read.delim ("WT1_m6a.metagene.coord.txt", header = T))
metagene.cord <- c(KO1.metagene.coord, WT1.metagene.coord)
mod <- c(rep("KO1", length(KO1.metagene.coord)),
rep("WT1", length(WT1.metagene.coord)))
df <- data.frame(metagene.cord, mod)
ggplot(df) + geom_density(aes(x = metagene.cord, colour = mod)) + xlim(0, 3) +
theme_bw() + geom_vline(xintercept = 1:2, col = "grey")
可以看到KO样本与WT样本上m6A修饰水平的差异。
这个绘图结果是基于ggplot2的,大家可以自行对此图进行加工修饰,然后就可以达到发表文章级别的美图啦!
我们下期再见~