联合使用deeptools和自编R脚本分析链特异性数据

不积跬步无以至千里

deeptools作为分析深度测序数据的一大利器,受到了广泛的欢迎和关注,这一点从Github的星标数以及conda的安装次数就可以知道,很多人都对它非常熟悉了。

但是不知道你关注过没有,deeptools本身是一个处理链非特异性数据的工具,这一点在它的介绍里可见一斑:

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers.

但是我们有的时候确实需要用deeptools去处理链特异性的数据(例如新生RNA测序数据),我们该怎么办呢?

一个非常简单的情景就是我们要分析新生RNA测序数据在基因上的分布特征,这个时候我们显然不能将来自于负链基因的转录本统计到同一区域的正链基因上。一个简单的思路就是:

  • 首先将BAM文件根据链拆分成正链和负链的BAM文件;

  • 再用正链的BAM文件去和正链基因的BED文件去computeMatrix,负链同理;

  • 最后再将正负链computeMatrix的结果进行合并。

示例
  • 数据来源:GSE38140
  • 数据类型:GRO-seq

首先下载数据并进行质控:

#-- data download
prefetch SRR828695
fastq-dump --split-3 SRR828695/SRR828695.sra
#-- FASTQC
mkdir fastqc
fastqc SRR828695/SRR828695_1.fastq -o fastqc/
#-- Quality Control
trim_galore --fastqc \
  --fastqc_args "-o fastqc" \
  --small_rna \
  --basename hct116

然后使用bowtie2进行比对:

#-- index
index="ref/bowtie2/hg38"
#-- mapping
bowtie2 --local -U hct116.fq -x $index | samtools view -q 20 -b -o hct116.bam
#-- remove duplicates (optional)
samtools markdup -r --output-fmt BAM hct116.bam hct116.flt.bam

判断一下链特异性:

infer_experiment.py -i hct116.flt.bam -r ref/genes.bed
This is SingleEnd Data
Fraction of reads failed to determine: 0.0881
Fraction of reads explained by "++,--": 0.7555
Fraction of reads explained by "+-,-+": 0.1565

属于stranded数据。
拆分正负链:

samtools view -F 16 -b -o hct116.p.bam hct116.flt.bam
samtools view -f 16 -b -o hct116.m.bam hct116.flt.bam
bamCoverage --bam hct116.p.bam --outFileName hct116.p.bw --binSize 1 --numberOfProcessors 5 --normalizeUsing CPM
bamCoverage --bam hct116.m.bam --outFileName hct116.m.bw --binSize 1 --numberOfProcessors 5 --normalizeUsing CPM

最后再分别computeMatrix

computeMatrix scale-regions -R genes.p.bed \
  -S hct116.p.bw \
  -m 10000 \
  -a 5000 \
  -b 5000 \
  -p 3 \
  --binSize 100 \
  --skipZeros \
  --outFileName tmp.gz \
  --outFileNameMatrix p.txt

负链同理。
这里一定不要忘了--outFileNameMatrix,这个文件会作为我们后续的R输入文件。

合并结果

为了合并现有的profile结果,我写了一个R命令行工具供大家使用,且帮我debug,欢迎大家使用,源代码如下:

#!/usr/local/bin/Rscript

#--
#@Author: Kun-Ming Shui, School of Life Sciences, Nanjing University (NJU).
#@Contribution list: ...

suppressMessages(library(argparse))
suppressMessages(library(ggplot2))
suppressMessages(library(patchwork))
suppressMessages(library(pheatmap))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(purrr))
suppressMessages(library(forcats))

parser <- ArgumentParser(prog = 'deeptools2r.R',
             description = 'This tool can help you visualize deeptools computeMatrix output in R.',
             epilog = 'Kun-Ming Shui, skm@smail.nju.edu.cn')

parser$add_argument('--version', '-v', action = 'version', version = '%(prog)s 1.0.0')
parser$add_argument('--input', '-i', nargs = '+', help = 'the complexHeatmap output file, multiple files should be separated by spaced.', required = TRUE)
parser$add_argument('--output', '-o', help = 'the output file name, "deeptools2r.out.pdf" by default.', default = 'deeptool2r.out.pdf')
parser$add_argument('--averageType', '-t', help = 'the type of stastics should be used for the profile, "mean" by default.', default = 'mean', choices = c('mean', 'max', 'min', 'median', 'sum'))
parser$add_argument('--plotType', help = 'the plot type for profile, "line" by default.', default = 'line', choices = c('line', 'heatmap', 'both'))
parser$add_argument('--colors', nargs = '+', help = 'the colors used for plot lines, multiple colors should be separated by spaced and should be equal with group information size, "None" by default.', default = NULL, required = FALSE)
parser$add_argument('--group', '-g', nargs = '+', help = 'group information for INPUT FILE, an important function of this tool is to combine profile data from forward and reverse strand. For example, if you have the file list: r1.fwd.tab r1.rev.tab r2.tab, you should pass "-g r1_f r1_r r2" to this argument. All in all, profile data from one sample but different strand should be taged with same group but different strand.', required = TRUE)
parser$add_argument('--startLabel', help = '[Only for scale-regions mode] Label shown in the plot for the start of the region, "TSS" by default.', default = 'TSS', required = FALSE)
parser$add_argument('--endLabel', help = '[Only for scale-regions mode] Label shown in the plot for the end of the region, "TES" by default.', default = 'TES', required = FALSE)
parser$add_argument('--refPointLabel', help = '[Only for reference-point mode] Label shown in the plot for the center of the region', default = 'center', required = FALSE)
parser$add_argument('--yMax', help = 'Maximum value for Y-axis, "None" by default.', type = 'double', default = NULL, required = FALSE)
parser$add_argument('--yMin', help = 'Minimum value for Y-axis, "None" by default.', type = 'double', default = NULL, required = FALSE)
parser$add_argument('--width', help = 'Width value for line plot, 0.7 by default', type = 'double', default = 0.7, required = FALSE)
parser$add_argument('--plotHeight', help = 'Plot height in inch, 5 by default.', default = 5, type = 'double', required = FALSE)
parser$add_argument('--plotWidth', help = 'Plot width in inch, 7 by default.', default = 7, type = 'double', required = FALSE)

args <- parser$parse_args()

groups <- args$group
FILES <- args$input

#--group information
if(length(groups) != length(FILES)) stop('The group information does not equal with sample number.')
#-group level
gp.level <- sapply(groups, FUN = function(group){
    if(grepl(group, pattern = '_[f|r]$')){
        str_list <- strsplit(group, split = "_", fixed = T)
        return(paste(str_list[[1]][1:(length(str_list[[1]])-1)], collapse = "_"))
    }else{
        return(group)
    }
})
gp.level <- unique(gp.level)

#--load data
data <- lapply(FILES, FUN = function(FILE){
           tmp <- read.table(file = FILE, header = FALSE, sep = "\t", skip = 3)
           gp <- groups[FILES == FILE]
           gp.info <- ifelse(grepl(gp, pattern = '[r|f]$'), 
                 gp %>% 
                    strsplit(split = "_", fixed = TRUE) %>% 
                    sapply(FUN = function(string){string[1:length(string)-1]}) %>%
                    sapply(FUN = function(string){paste(string, collapse = "_")}),
                 gp)
           tmp %>% mutate(group = rep(gp.info, nrow(tmp)))
})
data <- purrr::reduce(data, rbind)

#--tidy data
data <- data %>% 
  group_by(group) %>% 
  summarise_all(args$averageType, na.rm = TRUE) %>% 
  pivot_longer(cols = starts_with('V'), names_to = 'index', values_to = 'signal')

#--label
label.info <- read.table(file = FILES[1], comment.char = "", nrows = 2, fill = TRUE)
#-downstream
dw.size <- label.info[2, ] %>% 
    grep(pattern = 'downstream', value = T) %>% 
    gsub(pattern = '#', replacement = "") %>% 
    strsplit(split = ':', fixed = T) %>% 
    sapply('[[', 2) %>% 
    as.numeric()
#-upstream
up.size <- label.info[2, ] %>% 
    grep(pattern = 'upstream', value = T) %>% 
    gsub(pattern = '#', replacement = "") %>% 
    strsplit(split = ':', fixed = T) %>% 
    sapply('[[', 2) %>% 
    as.numeric()
#-body
bd.size <- label.info[2, ] %>% 
    grep(pattern = 'body', value = T) %>% 
    gsub(pattern = '#', replacement = "") %>% 
    strsplit(split = ':', fixed = T) %>% 
    sapply('[[', 2) %>% 
    as.numeric()
#-bin
binSize <- label.info[2, ] %>%
    grep(pattern = 'size', value = T) %>%
    gsub(pattern = '#', replacement = "") %>%
    strsplit(split = ':', fixed = T) %>%
    sapply('[[', 2) %>%
    as.numeric()

#--plot
line_plot <- data %>%
    mutate(index = fct_relevel(index, paste0('V', 1:((up.size + bd.size + dw.size)/binSize))),
           group = fct_relevel(group, gp.level)) %>%
    ggplot(., aes(x = index, y = signal)) +
    geom_line(aes(group = group, color = group), linewidth = args$width) +
    xlab(label = 'Position') +
    ylab(label = 'Signal') +
    theme_classic() +
    theme(axis.text = element_text(family = 'sans', color = 'black'),
          axis.ticks = element_line(color = 'black'),
          axis.title = element_text(family = 'sans', face = 'bold'))

#-label
if(bd.size != 0){
    startLabel <- ifelse(is.null(args$startLabel), 'TSS', args$startLabel)
    endLabel <- ifelse(is.null(args$endLabel), 'TSS', args$endLabel)
    breakPoints <- c('V1', paste0('V', c(up.size/binSize, (up.size + bd.size)/binSize)), paste0('V', (up.size + bd.size + dw.size)/binSize))
    line_plot <- line_plot + 
        scale_x_discrete(breaks = breakPoints, labels = c(paste0("-", up.size/1000, ' kb'), startLabel, endLabel, paste0(dw.size/1000, ' kb')))
}else{
    refPointLabel <- ifelse(is.null(args$refPointLabel), 'center', args$refPointLabel)
    breakPoints <- c('V1', paste0('V', up.size/binSize), paste0('V', (up.size + dw.size)/binSize))
    line_plot <- line_plot +
        scale_x_discrete(breaks = breakPoints, labels = c(paste0("-", up.size/1000, ' kb'), refPointLabel, paste0(dw.size/1000, ' kb')))
}

#--Y-region
if(!is.null(args$yMin) && !is.null(args$yMax)){
    line_plot <- line_plot + 
        coord_cartesian(ylim = c(args$yMin, args$yMax))
}

#--colors
if(!is.null(args$colors) && length(args$colors) == length(gp.level)){
    line_plot <- line_plot + 
        scale_color_manual(values = args$colors)
}else if(!is.null(args$colors) && length(args$colors) != length(gp.level)){
    print("Warning: Your color number doesn't match group number, use default set instead!")
}

plot <- line_plot
ggsave(filename = args$output, plot = plot, width = args$plotWidth, height = args$plotHeight, units = 'in')

cat(paste0('Finished at ', date(), '!\n'))
q(save = 'no')

请大家先安装好依赖包,都很常见:

argparse
ggplot2
patchwork
pheatmap
tidyr
dplyr
purrr
forcats

另外,热图还在开发当中,目前还不支持,更新好后我会及时放在这里。
使用方法:

deeptools2r.R --help
usage: deeptools2r.R [-h] [--version] --input INPUT [INPUT ...]
                     [--output OUTPUT]
                     [--averageType {mean,max,min,median,sum}]
                     [--plotType {line,heatmap,both}]
                     [--colors COLORS [COLORS ...]] --group GROUP [GROUP ...]
                     [--startLabel STARTLABEL] [--endLabel ENDLABEL]
                     [--refPointLabel REFPOINTLABEL] [--yMax YMAX]
                     [--yMin YMIN] [--width WIDTH] [--plotHeight PLOTHEIGHT]
                     [--plotWidth PLOTWIDTH]

This tool can help you visualize deeptools computeMatrix output in R.

optional arguments:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit
  --input INPUT [INPUT ...], -i INPUT [INPUT ...]
                        the complexHeatmap output file, multiple files should
                        be separated by spaced.
  --output OUTPUT, -o OUTPUT
                        the output file name, "deeptools2r.out.pdf" by
                        default.
  --averageType {mean,max,min,median,sum}, -t {mean,max,min,median,sum}
                        the type of stastics should be used for the profile,
                        "mean" by default.
  --plotType {line,heatmap,both}
                        the plot type for profile, "line" by default.
  --colors COLORS [COLORS ...]
                        the colors used for plot lines, multiple colors should
                        be separated by spaced and should be equal with group
                        information size, "None" by default.
  --group GROUP [GROUP ...], -g GROUP [GROUP ...]
                        group information for INPUT FILE, an important
                        function of this tool is to combine profile data from
                        forward and reverse strand. For example, if you have
                        the file list: r1.fwd.tab r1.rev.tab r2.tab, you
                        should pass "-g r1_f r1_r r2" to this argument. All in
                        all, profile data from one sample but different strand
                        should be taged with same group but different strand.
  --startLabel STARTLABEL
                        [Only for scale-regions mode] Label shown in the plot
                        for the start of the region, "TSS" by default.
  --endLabel ENDLABEL   [Only for scale-regions mode] Label shown in the plot
                        for the end of the region, "TES" by default.
  --refPointLabel REFPOINTLABEL
                        [Only for reference-point mode] Label shown in the
                        plot for the center of the region
  --yMax YMAX           Maximum value for Y-axis, "None" by default.
  --yMin YMIN           Minimum value for Y-axis, "None" by default.
  --width WIDTH         Width value for line plot, 0.7 by default
  --plotHeight PLOTHEIGHT
                        Plot height in inch, 5 by default.
  --plotWidth PLOTWIDTH
                        Plot width in inch, 7 by default.

Kun-Ming Shui, skm@smail.nju.edu.cn
浅试一下这个工具:
deeptools2r.R --input m.txt p.txt \
    --output deeptools.pdf \
    --group hct116_r hct116_f \
    --plotHeight 2 \
    --plotWidth 3 \
    --colors '#045a8d'


这是符合经典的GRO-seq数据分布模式的~。

号外

我每次写东西时都不会吝啬于代码的分享,一方面想和大家共同进步,另一方面也是想大家帮我debug,欢迎大家试用这些小工具~

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,271评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,275评论 2 380
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,151评论 0 336
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,550评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,553评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,559评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,924评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,580评论 0 257
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,826评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,578评论 2 320
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,661评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,363评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,940评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,926评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,156评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,872评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,391评论 2 342

推荐阅读更多精彩内容