联合使用deeptools和自编R脚本分析链特异性数据

不积跬步无以至千里

deeptools作为分析深度测序数据的一大利器，受到了广泛的欢迎和关注，这一点从Github的星标数以及conda的安装次数就可以知道，很多人都对它非常熟悉了。

但是不知道你关注过没有，deeptools本身是一个处理链非特异性数据的工具，这一点在它的介绍里可见一斑：

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers.

但是我们有的时候确实需要用deeptools去处理链特异性的数据（例如新生RNA测序数据），我们该怎么办呢？

一个非常简单的情景就是我们要分析新生RNA测序数据在基因上的分布特征，这个时候我们显然不能将来自于负链基因的转录本统计到同一区域的正链基因上。一个简单的思路就是：

首先将BAM文件根据链拆分成正链和负链的BAM文件；
再用正链的BAM文件去和正链基因的BED文件去computeMatrix，负链同理；
最后再将正负链computeMatrix的结果进行合并。

示例

数据来源：GSE38140
数据类型：GRO-seq

首先下载数据并进行质控：

#-- data download
prefetch SRR828695
fastq-dump --split-3 SRR828695/SRR828695.sra
#-- FASTQC
mkdir fastqc
fastqc SRR828695/SRR828695_1.fastq -o fastqc/
#-- Quality Control
trim_galore --fastqc \
  --fastqc_args "-o fastqc" \
  --small_rna \
  --basename hct116

然后使用bowtie2进行比对：

#-- index
index="ref/bowtie2/hg38"
#-- mapping
bowtie2 --local -U hct116.fq -x $index | samtools view -q 20 -b -o hct116.bam
#-- remove duplicates (optional)
samtools markdup -r --output-fmt BAM hct116.bam hct116.flt.bam

判断一下链特异性：

infer_experiment.py -i hct116.flt.bam -r ref/genes.bed

This is SingleEnd Data
Fraction of reads failed to determine: 0.0881
Fraction of reads explained by "++,--": 0.7555
Fraction of reads explained by "+-,-+": 0.1565

属于stranded数据。
拆分正负链：

samtools view -F 16 -b -o hct116.p.bam hct116.flt.bam
samtools view -f 16 -b -o hct116.m.bam hct116.flt.bam
bamCoverage --bam hct116.p.bam --outFileName hct116.p.bw --binSize 1 --numberOfProcessors 5 --normalizeUsing CPM
bamCoverage --bam hct116.m.bam --outFileName hct116.m.bw --binSize 1 --numberOfProcessors 5 --normalizeUsing CPM

最后再分别computeMatrix：

computeMatrix scale-regions -R genes.p.bed \
  -S hct116.p.bw \
  -m 10000 \
  -a 5000 \
  -b 5000 \
  -p 3 \
  --binSize 100 \
  --skipZeros \
  --outFileName tmp.gz \
  --outFileNameMatrix p.txt

负链同理。
这里一定不要忘了--outFileNameMatrix，这个文件会作为我们后续的R输入文件。

合并结果

为了合并现有的profile结果，我写了一个R命令行工具供大家使用，且帮我debug，欢迎大家使用，源代码如下：

#!/usr/local/bin/Rscript

#--
#@Author: Kun-Ming Shui, School of Life Sciences, Nanjing University (NJU).
#@Contribution list: ...

suppressMessages(library(argparse))
suppressMessages(library(ggplot2))
suppressMessages(library(patchwork))
suppressMessages(library(pheatmap))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(purrr))
suppressMessages(library(forcats))

parser <- ArgumentParser(prog = 'deeptools2r.R',
             description = 'This tool can help you visualize deeptools computeMatrix output in R.',
             epilog = 'Kun-Ming Shui, skm@smail.nju.edu.cn')

parser$add_argument('--version', '-v', action = 'version', version = '%(prog)s 1.0.0')
parser$add_argument('--input', '-i', nargs = '+', help = 'the complexHeatmap output file, multiple files should be separated by spaced.', required = TRUE)
parser$add_argument('--output', '-o', help = 'the output file name, "deeptools2r.out.pdf" by default.', default = 'deeptool2r.out.pdf')
parser$add_argument('--averageType', '-t', help = 'the type of stastics should be used for the profile, "mean" by default.', default = 'mean', choices = c('mean', 'max', 'min', 'median', 'sum'))
parser$add_argument('--plotType', help = 'the plot type for profile, "line" by default.', default = 'line', choices = c('line', 'heatmap', 'both'))
parser$add_argument('--colors', nargs = '+', help = 'the colors used for plot lines, multiple colors should be separated by spaced and should be equal with group information size, "None" by default.', default = NULL, required = FALSE)
parser$add_argument('--group', '-g', nargs = '+', help = 'group information for INPUT FILE, an important function of this tool is to combine profile data from forward and reverse strand. For example, if you have the file list: r1.fwd.tab r1.rev.tab r2.tab, you should pass "-g r1_f r1_r r2" to this argument. All in all, profile data from one sample but different strand should be taged with same group but different strand.', required = TRUE)
parser$add_argument('--startLabel', help = '[Only for scale-regions mode] Label shown in the plot for the start of the region, "TSS" by default.', default = 'TSS', required = FALSE)
parser$add_argument('--endLabel', help = '[Only for scale-regions mode] Label shown in the plot for the end of the region, "TES" by default.', default = 'TES', required = FALSE)
parser$add_argument('--refPointLabel', help = '[Only for reference-point mode] Label shown in the plot for the center of the region', default = 'center', required = FALSE)
parser$add_argument('--yMax', help = 'Maximum value for Y-axis, "None" by default.', type = 'double', default = NULL, required = FALSE)
parser$add_argument('--yMin', help = 'Minimum value for Y-axis, "None" by default.', type = 'double', default = NULL, required = FALSE)
parser$add_argument('--width', help = 'Width value for line plot, 0.7 by default', type = 'double', default = 0.7, required = FALSE)
parser$add_argument('--plotHeight', help = 'Plot height in inch, 5 by default.', default = 5, type = 'double', required = FALSE)
parser$add_argument('--plotWidth', help = 'Plot width in inch, 7 by default.', default = 7, type = 'double', required = FALSE)

args <- parser$parse_args()

groups <- args$group
FILES <- args$input

#--group information
if(length(groups) != length(FILES)) stop('The group information does not equal with sample number.')
#-group level
gp.level <- sapply(groups, FUN = function(group){
    if(grepl(group, pattern = '_[f|r]$')){
        str_list <- strsplit(group, split = "_", fixed = T)
        return(paste(str_list[[1]][1:(length(str_list[[1]])-1)], collapse = "_"))
    }else{
        return(group)
    }
})
gp.level <- unique(gp.level)

#--load data
data <- lapply(FILES, FUN = function(FILE){
           tmp <- read.table(file = FILE, header = FALSE, sep = "\t", skip = 3)
           gp <- groups[FILES == FILE]
           gp.info <- ifelse(grepl(gp, pattern = '[r|f]$'), 
                 gp %>% 
                    strsplit(split = "_", fixed = TRUE) %>% 
                    sapply(FUN = function(string){string[1:length(string)-1]}) %>%
                    sapply(FUN = function(string){paste(string, collapse = "_")}),
                 gp)
           tmp %>% mutate(group = rep(gp.info, nrow(tmp)))
})
data <- purrr::reduce(data, rbind)

#--tidy data
data <- data %>% 
  group_by(group) %>% 
  summarise_all(args$averageType, na.rm = TRUE) %>% 
  pivot_longer(cols = starts_with('V'), names_to = 'index', values_to = 'signal')

#--label
label.info <- read.table(file = FILES[1], comment.char = "", nrows = 2, fill = TRUE)
#-downstream
dw.size <- label.info[2, ] %>% 
    grep(pattern = 'downstream', value = T) %>% 
    gsub(pattern = '#', replacement = "") %>% 
    strsplit(split = ':', fixed = T) %>% 
    sapply('[[', 2) %>% 
    as.numeric()
#-upstream
up.size <- label.info[2, ] %>% 
    grep(pattern = 'upstream', value = T) %>% 
    gsub(pattern = '#', replacement = "") %>% 
    strsplit(split = ':', fixed = T) %>% 
    sapply('[[', 2) %>% 
    as.numeric()
#-body
bd.size <- label.info[2, ] %>% 
    grep(pattern = 'body', value = T) %>% 
    gsub(pattern = '#', replacement = "") %>% 
    strsplit(split = ':', fixed = T) %>% 
    sapply('[[', 2) %>% 
    as.numeric()
#-bin
binSize <- label.info[2, ] %>%
    grep(pattern = 'size', value = T) %>%
    gsub(pattern = '#', replacement = "") %>%
    strsplit(split = ':', fixed = T) %>%
    sapply('[[', 2) %>%
    as.numeric()

#--plot
line_plot <- data %>%
    mutate(index = fct_relevel(index, paste0('V', 1:((up.size + bd.size + dw.size)/binSize))),
           group = fct_relevel(group, gp.level)) %>%
    ggplot(., aes(x = index, y = signal)) +
    geom_line(aes(group = group, color = group), linewidth = args$width) +
    xlab(label = 'Position') +
    ylab(label = 'Signal') +
    theme_classic() +
    theme(axis.text = element_text(family = 'sans', color = 'black'),
          axis.ticks = element_line(color = 'black'),
          axis.title = element_text(family = 'sans', face = 'bold'))

#-label
if(bd.size != 0){
    startLabel <- ifelse(is.null(args$startLabel), 'TSS', args$startLabel)
    endLabel <- ifelse(is.null(args$endLabel), 'TSS', args$endLabel)
    breakPoints <- c('V1', paste0('V', c(up.size/binSize, (up.size + bd.size)/binSize)), paste0('V', (up.size + bd.size + dw.size)/binSize))
    line_plot <- line_plot + 
        scale_x_discrete(breaks = breakPoints, labels = c(paste0("-", up.size/1000, ' kb'), startLabel, endLabel, paste0(dw.size/1000, ' kb')))
}else{
    refPointLabel <- ifelse(is.null(args$refPointLabel), 'center', args$refPointLabel)
    breakPoints <- c('V1', paste0('V', up.size/binSize), paste0('V', (up.size + dw.size)/binSize))
    line_plot <- line_plot +
        scale_x_discrete(breaks = breakPoints, labels = c(paste0("-", up.size/1000, ' kb'), refPointLabel, paste0(dw.size/1000, ' kb')))
}

#--Y-region
if(!is.null(args$yMin) && !is.null(args$yMax)){
    line_plot <- line_plot + 
        coord_cartesian(ylim = c(args$yMin, args$yMax))
}

#--colors
if(!is.null(args$colors) && length(args$colors) == length(gp.level)){
    line_plot <- line_plot + 
        scale_color_manual(values = args$colors)
}else if(!is.null(args$colors) && length(args$colors) != length(gp.level)){
    print("Warning: Your color number doesn't match group number, use default set instead!")
}

plot <- line_plot
ggsave(filename = args$output, plot = plot, width = args$plotWidth, height = args$plotHeight, units = 'in')

cat(paste0('Finished at ', date(), '!\n'))
q(save = 'no')

请大家先安装好依赖包，都很常见：

argparse
ggplot2
patchwork
pheatmap
tidyr
dplyr
purrr
forcats

另外，热图还在开发当中，目前还不支持，更新好后我会及时放在这里。
使用方法：

deeptools2r.R --help

usage: deeptools2r.R [-h] [--version] --input INPUT [INPUT ...]
                     [--output OUTPUT]
                     [--averageType {mean,max,min,median,sum}]
                     [--plotType {line,heatmap,both}]
                     [--colors COLORS [COLORS ...]] --group GROUP [GROUP ...]
                     [--startLabel STARTLABEL] [--endLabel ENDLABEL]
                     [--refPointLabel REFPOINTLABEL] [--yMax YMAX]
                     [--yMin YMIN] [--width WIDTH] [--plotHeight PLOTHEIGHT]
                     [--plotWidth PLOTWIDTH]

This tool can help you visualize deeptools computeMatrix output in R.

optional arguments:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit
  --input INPUT [INPUT ...], -i INPUT [INPUT ...]
                        the complexHeatmap output file, multiple files should
                        be separated by spaced.
  --output OUTPUT, -o OUTPUT
                        the output file name, "deeptools2r.out.pdf" by
                        default.
  --averageType {mean,max,min,median,sum}, -t {mean,max,min,median,sum}
                        the type of stastics should be used for the profile,
                        "mean" by default.
  --plotType {line,heatmap,both}
                        the plot type for profile, "line" by default.
  --colors COLORS [COLORS ...]
                        the colors used for plot lines, multiple colors should
                        be separated by spaced and should be equal with group
                        information size, "None" by default.
  --group GROUP [GROUP ...], -g GROUP [GROUP ...]
                        group information for INPUT FILE, an important
                        function of this tool is to combine profile data from
                        forward and reverse strand. For example, if you have
                        the file list: r1.fwd.tab r1.rev.tab r2.tab, you
                        should pass "-g r1_f r1_r r2" to this argument. All in
                        all, profile data from one sample but different strand
                        should be taged with same group but different strand.
  --startLabel STARTLABEL
                        [Only for scale-regions mode] Label shown in the plot
                        for the start of the region, "TSS" by default.
  --endLabel ENDLABEL   [Only for scale-regions mode] Label shown in the plot
                        for the end of the region, "TES" by default.
  --refPointLabel REFPOINTLABEL
                        [Only for reference-point mode] Label shown in the
                        plot for the center of the region
  --yMax YMAX           Maximum value for Y-axis, "None" by default.
  --yMin YMIN           Minimum value for Y-axis, "None" by default.
  --width WIDTH         Width value for line plot, 0.7 by default
  --plotHeight PLOTHEIGHT
                        Plot height in inch, 5 by default.
  --plotWidth PLOTWIDTH
                        Plot width in inch, 7 by default.

Kun-Ming Shui, skm@smail.nju.edu.cn

浅试一下这个工具：

deeptools2r.R --input m.txt p.txt \
    --output deeptools.pdf \
    --group hct116_r hct116_f \
    --plotHeight 2 \
    --plotWidth 3 \
    --colors '#045a8d'

这是符合经典的GRO-seq数据分布模式的~。

号外

我每次写东西时都不会吝啬于代码的分享，一方面想和大家共同进步，另一方面也是想大家帮我debug，欢迎大家试用这些小工具~

联合使用deeptools和自编R脚本分析链特异性数据