其他拓展阅读
https://www.jianshu.com/p/27e91c8dac2f #ChIP-seq与CUT&Tag的介绍与对比
https://www.sohu.com/a/331749726_120054289 #CUT&Tag的详解
CUT&Tag for efficient epigenomic profiling of small samples and single cells #技术原文
https://www.abcam.cn/epigenetics/chromatin-profiling-guide-1 #ChIP-seq、CUT&RUN、CUT&Tag等的综合介绍
Single-cell CUT&Tag analysis of chromatin modifications in differentiation and tumor progression #单细胞CUT&Tag技术文章
重复分析主要参考:
流程中出现K4me3写为K4m3的错误,请注意
流程如下:
01.数据的下载
根据流程,从SRA数据库下载相关数据:
cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/00.data
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/SRR12246717/SRR12246717.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-19/SRR11074240/SRR11074240.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/SRR11074254/SRR11074254.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-19/SRR11074258/SRR11074258.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/SRR11923224/SRR11923224.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-15/SRR8754611/SRR8754611.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-15/SRR8754612/SRR8754612.1
for i in `ls *.1`;do `fastq-dump --split-3 $i`;done
#重命名数据
mv SRR11074240.1_1.fastq K27me3_rep2_1.fastq
mv SRR11074240.1_2.fastq K27me3_rep2_2.fastq
mv SRR11074258.1_1.fastq K4me3_rep2_1.fastq
mv SRR11074258.1_2.fastq K4me3_rep2_2.fastq
mv SRR11074254.1_1.fastq K4me3_rep1_1.fasyq
mv SRR11074254.1_2.fastq K4me3_rep1_2.fasyq
mv SRR11923224.1_1.fastq IgG_rep1_1.fastq
mv SRR11923224.1_2.fastq IgG_rep1_2.fastq
mv SRR12246717.1_1.fastq K27me3_rep1_1.fastq
mv SRR12246717.1_2.fastq K27me3_rep1_2.fastq
cat SRR8754611.1_1.fastq SRR8754612.1_1.fastq >IgG_rep2_1.fastq
cat SRR8754611.1_2.fastq SRR8754612.1_2.fastq>IgG_rep2_2.fastq
样品名称与数据名称的对应关系为:
02.数据的预处理
01.使用fastqc进行质量的检查
所有样本的数据都符合要求:包括GC content、adapter含量等,所以不进行低质量reads的过滤、adapter的trim等。如果数据质量不符合,则进行reads的过滤,个人一般使用的是fastp(可以自动识别接头)。
cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/01.Data_Pre_processing/
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/00.data/*_1.fastq`;do dir_path=$(dirname $i); file_name=$(basename $i _1.fastq); echo "fastqc -t 10 -o ./ $dir_path/${file_name}_1.fastq">>fastqc.sh; echo "fastqc -t 10 -o ./ $dir_path/${file_name}_2.fastq">>fastqc.sh;done
ParaFly -c fastqc.sh -CPU 2
02.如果有技术重复,可以考虑进行数据的合并
如果有技术重复,可以在fastq层面直接进行合并。
03.序列的比对
包含Tn5接头和PCR条形码的插入序列结构如下:
01.比对至hg38基因组
该测序数据不需要进行数据的修剪:因为它使用的是25*25 PE的标准测序,因此接头序列不会被包含在数据里面。在更长的测序中,需要使用以下的参数:“--local --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700”进行比对,以规避可能存在的接头序列。或者提前进行修剪。
cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/00.data/*_1.fastq`;do dir_path=$(dirname $i); file_name=$(basename $i _1.fastq); echo "bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p 10 -x /public/home/jlwang_gibh/project/00.data/03.GRCh38/GRCh38.primary_assembly.genome -1 $dir_path/${file_name}_1.fastq -2 $dir_path/${file_name}_2.fastq -S $file_name.sam 2>$file_name.summary">>bowtie2.sh;done
nohup ParaFly -c bowtie2.sh -CPU 2 2>&1 &
02.比对至E.coli基因组
大肠杆菌的DNA携带着细菌产生的pA-Tn5蛋白,并在反应过程中被非特异性标记。比对到大肠杆菌基因组的reads数量取决于与目标蛋白结合的CUT&Tag数量,因此也取决于使用的细胞数量和染色质中该目标蛋白的丰度。由于在CUT&Tag反应中加入了固定数量的pATn5,并携带了固定数量的大肠杆菌DNA,因此在一系列实验中大肠杆菌reads可以被用来spike-in目标蛋白所提取信号的丰度。但这边的假设是,样本的细胞数和所加的pATn5需要一样/接近,而一般没有经过精心设计的实验,无法达到该假设,所以想用这种spike-in方法,进行矫正,反而可能是错误的。聚体是否使用此来进行spike-in需考虑实验过程产生的影响。
cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/00.data/*_1.fastq`;do dir_path=$(dirname $i); file_name=$(basename $i _1.fastq); echo "bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p 10 -x /public/home/jlwang_gibh/project/00.data/04.Ecoli_Genome/Ecoli.genome -1 $dir_path/${file_name}_1.fastq -2 $dir_path/${file_name}_2.fastq -S $file_name.ecoli.sam 2>$file_name.ecoli.summary">>bowtie2.ecoli.sh;done
nohup ParaFly -c bowtie2.ecoli.sh -CPU 2 2>&1 &
for i in `ls *sam`; do echo $i >>seqDepth; echo `samtools view -F 0x04 $i|wc -l `>>seqDepth;done
进行spike-in标准化时候,需额外使用--no-overlap和--no-dovetail两个额外的参数将序列比对到e.coli基因组上(--end-to-end --very-sensitive --no-overlap --no-dovetail --no-mixed --no-discordant --phred33 -I 10 -X 700
),去避免可能的实验的基因组序列交叉比对到用于 E. coli DNA校准的序列。
03.对比对结果进行统计
对测序深度、比对率、可比对reads的数量、重复序列比率、唯一序列的比率片段大小分布等进行统计
查看全局比对与唯一比对率,比对率高于80%的,认为是高质量的数据。因为CUT&Tag有着十分低的背景噪音,所以只需要超过1 Million的reads数量便可以获得较可靠的结果。而一些低丰度的转录因子和染色体蛋白,需要超过10倍于比对数量的reads用于下游分析。所以需要考虑CUT&Tag的所检测信号类型,不同的转录因子差异会十分大。
perl bowtie2_summary.pl
并手动将比对深度与其他比对统计指标合并
比对结果:
04.脚本
bowtie2_summary.pl
#!perl
@file= `ls *summary`;
open OUT,">All_bowtie2_summary";
print OUT "File Total reads aligned 0 times aligned exactly 1 time overall alignment rate\n";
foreach $file(@file){
#print $file,"\n";
open IN,"$file";
$file=~s/.summary\n//g;
while(<IN>){
chomp;
$_=~s/\r//g;
if(/reads; of these:/){
$_=~/(.*) reads; of these:/;
$total_reads{$file}=$1;
}
if(/aligned concordantly exactly 1 time/){
$_=~/\s+(.*) aligned concordantly exactly 1 time/;
$aligned_exactly_1_time{$file}=$1;
}
if(/ overall alignment rate/){
$_=~/(.*) overall alignment rate/;
$overall_alignment_rate{$file}=$1;
#print $overall_alignment_rate{$file},"\n";
}
}
close IN;
}
foreach(sort keys %overall_alignment_rate){
print OUT "$_\t$total_reads{$_}\t$aligned_exactly_1_time{$_}\t$overall_alignment_rate{$_}\n";
}
04.重复序列的去除
因为CUT&Tag的特性,将会整合接头到pA-Tn5结合抗体周围的DNA区域中,且整合的位点受到DNA可及性的影响。所以在此因素的影响下,获取的片段有着一致的起始与终止位点是可能的,而不是因为PCR扩增的原因。在高质量的文库中,不需要进行CUT&Tag重复序列的去除。而当文库初始浓度、总量很少或者PCR扩增次数很多的时候,重复序列需要进行去除。不过因为很多数据都不能像此demo的数据一样,质量十分好,所以一般推荐把重复序列去除。特别是一些检测转录因子结合信号的,一般重复序列的比例都很高,不大可能是由CUT&Tag的特性导致的,更可能的原因应该是光学重复和PCR重复。
cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/03.picard
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping/*sam`; do dir_path=$(dirname $i); file_name=$(basename $i .sam);echo "picard SortSam I=$i O=$file_name.sorted.sam SORT_ORDER=coordinate">>$file_name.picard.sh; echo "picard MarkDuplicates I=$file_name.sorted.sam O=$file_name.sorted.dupMarked.sam METRICS_FILE=$file_name.picard.dupMark.txt">>$file_name.picard.sh; echo "picard MarkDuplicates I=$file_name.sorted.sam O=$file_name.sorted.rmDup.sam REMOVE_DUPLICATES=true METRICS_FILE=$file_name.picard.rmDup.txt" >>$file_name.picard.sh;done
rm *.ecoli.picard.sh
for i in `ls *sh`;do echo "bash $i ">>picard.sh;done
ParaFly -c picard.sh -CPU 10
perl picard_summary.pl
查看结果
01.脚本
picard_summary.pl
#!perl
@file= `ls *dupMark.txt`;
open OUT,">All_picard_summary";
print OUT "File name\tLibrary size\tDuplication ratios\n";
foreach $file(@file){
#print $file,"\n";
open IN,"$file";
$file=~s/.picard.dupMark.txt\n//g;
while(<IN>){
chomp;
$_=~s/\n//g;
if(/SECONDARY_OR_SUPPLEMENTARY_RDS/){
$value=<IN>;
$value=~s/\n//g;
@values=split /\t/,$value;
$ESTIMATED_LIBRARY_SIZE=$values[9];
$PERCENT_DUPLICATION=$values[8];
}
}
print OUT "$file\t$ESTIMATED_LIBRARY_SIZE\t$PERCENT_DUPLICATION\n";
}
05.评估比对片段大小的分布
因为CUT&Tag的特性,其产生的片段长度应该在一个核小体覆盖DNA的长度(180bp)或者多个核小体覆盖DNA的长度,所以在分布图上,会在180×n位置处出现鼓包。可以以此来评估数据质量的好坏。
首先在linux中统计比对上序列的长度
cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/04.assessing_length_of_fragments
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping/*sam`; do dir_path=$(dirname $i); file_name=$(basename $i .sam); samtools view -F 0x04 $i|awk -F '\t' 'function abs(x){return ((x < 0.0) ? -x : x)}{print abs($9)}' | sort | uniq -c | awk -v OFS="\t" '{print $2, $1/2}'>$file_name.fragmentLen.txt;done
rm -rf *.ecoli.fragmentLen.txt
之后在R中生成用于绘图的表格数据
sampleList = c("K27me3_rep1","K27me3_rep2","K4me3_rep1","K4me3_rep2","IgG_rep1","IgG_rep2")
histList = c("K27me3","K4me3","IgG")
All<-data.frame(matrix(ncol=6,nrow=0))
colnames(All)<-c("V1","V2","Weight","Histone","Replicate","sampleInfo")
for(hist in sampleList){
histInfo = strsplit(hist, "_")[[1]]
AA = read.table(paste0(hist,".fragmentLen.txt"), header = FALSE)
for(i in colnames(AA)){
AA[,i]<-as.numeric(AA[,i])
}
BB<-mutate(AA, Weight = as.numeric(V2)/sum(as.numeric(V2)), Histone = histInfo[1], Replicate = histInfo[2], sampleInfo = hist)
All=rbind(All,BB)
}
write.table(All,"length_of_fragments.txt",sep="\t")
library("ggpubr")
library("ggplot2")
library("viridis")
setwd("C:/Users/Kong_lab_user/Desktop")
sampleList = c("K27me3_rep1","K27me3_rep2","K4me3_rep1","K4me3_rep2","IgG_rep1","IgG_rep2")
histList = c("K27me3","K4me3","IgG")
length_of_fragments<-read.table("length_of_fragments.txt",sep="\t",header = T)
length_of_fragments$sampleInfo = factor(length_of_fragments$sampleInfo, levels = sampleList)
length_of_fragments$Histone = factor(length_of_fragments$Histone, levels = histList)
## Generate the fragment size density plot (violin plot)
fig5A=ggplot(length_of_fragments,aes(x = sampleInfo, y = V1, weight = Weight, fill = Histone)) +
geom_violin(bw = 5) +
scale_y_continuous(breaks = seq(0, 800, 50)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 20) +
ggpubr::rotate_x_text(angle = 20) +
ylab("Fragment Length") +
xlab("")
fig5B =ggplot(length_of_fragments,aes(x = V1, y = V2, color = Histone, group = sampleInfo, linetype = Replicate)) +
geom_line(size = 1) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma") +
theme_bw(base_size = 20) +
xlab("Fragment Length") +
ylab("Count") +
coord_cartesian(xlim = c(0, 500))
ggarrange(fig5A, fig5B, ncol = 2)
06.比对结果的过滤与格式转换
01.根据比对分数进行过滤
有一些项目需要严格的对比对分数进行过滤,可以使用samtools进行过滤。这边使用的数据是未去除重复序列的。
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping/*sam`; do dir_path=$(dirname $i); file_name=$(basename $i .sam);echo "grep ^@ $i>$file_name.qualityScore2.sam">>$file_name.filtering_based_on_score.sh;echo "samtools view -q 2 $i>>$file_name.qualityScore2.sam">>$file_name.filtering_based_on_score.sh;done
for i in `ls *filtering_based_on_score.sh`;do echo "bash $i">>filtering_based_on_score.sh;done
ParaFly -c filtering_based_on_score.sh -CPU 12
rm *.ecoli.qualityScore2.sam
02.格式转换
- 此处存在着一个error,即在使用bedtools bamtobed在 -bedpe 参数下,将bam转为bed时,需要先将bam依据名字排序,不然会出现warning:
WARNING: Query SRR3286802-24999 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
这时整个数据会因为出现这个warning而导致信息丢失(skipping掉很多reads,造成分析的不准确),以致于后续的分析出现错误。此处错误在本demo中未修改。
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/05.filtering_and_conversion/*sam`;do dir_path=$(dirname $i); file_name=$(basename $i .qualityScore2.sam);echo "samtools view -bS -F 0x04 $i >$file_name.qualityScore2.mapped.bam">>$file_name.conversion.sh;echo "bedtools bamtobed -i $file_name.qualityScore2.mapped.bam -bedpe >$file_name.qualityScore2.bed">>$file_name.conversion.sh;echo "awk '\$1==\$4 && \$6-\$2 < 1000 {print \$0}' $file_name.qualityScore2.bed >$file_name.qualityScore2.clean.bed">>$file_name.conversion.sh; echo "cut -f 1,2,6 $file_name.qualityScore2.clean.bed | sort -k1,1 -k2,2n -k3,3n >$file_name.qualityScore2.fragments.bed">>$file_name.conversion.sh;done
for i in `ls *conversion.sh`;do echo "bash $i">>conversion.sh;done
ParaFly -c conversion.sh -CPU 12
03.评估重复样品的重复性
以500bp的bin长度统计基因组上的mapping率
for i in `ls *qualityScore2.fragments.bed`;do dir_path=$(dirname $i); file_name=$(basename $i .bed); echo "awk -v w=500 '{print \$1, int((\$2 + \$3)/(2*w))*w + w/2}' $i | sort -k1,1V -k2,2n | uniq -c | awk -v OFS=\"\t\" '{print \$2, \$3, \$1}' | sort -k1,1V -k2,2n >$file_name.bin500.bed">>$file_name.replicate_reproducibility.sh;done
for i in `ls *replicate_reproducibility.sh`;do echo "bash $i">>replicate_reproducibility.sh;done
ParaFly -c replicate_reproducibility.sh -CPU 12
之后在R中进行数据的整合和相关性分析,将生成的bin500.bed下载到本地
library("ggpubr")
library("ggplot2")
library("viridis")
library("dplyr")
library("corrplot")
setwd("C:/Users/Kong_lab_user/Desktop")
#设置读取的sampels
sampleList = c("K27me3_rep1","K27me3_rep2","K4me3_rep1","K4me3_rep2","IgG_rep1","IgG_rep2")
#创建空的bed数据框
bed_data_frame=NULL
#循环读取数据
for(i in sampleList){
#设置文件名
bed_file_name<-paste(i,".qualityScore2.fragments.bin500.bed",collapse = "",sep="")
#读取文件
bed_file<-read.table(bed_file_name,header = F,sep="\t")
#设置读入数据框的列名
colnames(bed_file)<-c("chrom", "bin",i)
#整合不同文件的bed数据
if(length(bed_data_frame)==0){
bed_data_frame<-bed_file
}else{
bed_data_frame<-full_join(bed_data_frame, bed_file, by = c("chrom", "bin"))
}
}
#进行相关性分析
M = cor(bed_data_frame %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs")
#绘制相关性图
corrplot(M, method = "color",
outline = T,
addgrid.col = "darkgray",
order="hclust", addrect = 3,
rect.col = "black",
rect.lwd = 3,
cl.pos = "b",
tl.col = "indianred4",
tl.cex = 1,
cl.cex = 1,
addCoef.col = "black",
number.digits = 2,
number.cex = 1,
col = colorRampPalette(c("midnightblue","white","darkred"))(100))
07.Spike-in校正
能进行校正的基本假设是:在一系列使用相同数量细胞的样本中,比对到E.coli基因组上reads的比例是相同的。由于这个假设,分析流程没有进行样品间和批次间的标准化,这会使得残余的E.coli DNA数量差异很大。可以定义一个S,然后去标准化比对深度。
S定义为
Normalized coverage的定义为
首先计算出每个样本的S值(scale_factor)
根据03.03中获得的表格,可以进行每个样本的S值,设定C为10000
cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/06.spike_in
#生成bedtools所需的基因组信息文件
perl -e 'while(<>){if(/>/){chomp;$chr=$_;$chr=~/>(.*)\s/;$chr=$1}else{$sequence{$chr}.=$_}};open OUT,">genome_information.txt";foreach $chr(sort keys %sequence){$length=length($sequence{$chr});print OUT "$chr\t$length\n"}' GRCh38.primary_assembly.genome.fa
#获得校正后的bedgraph文件
perl -e 'while(<>){chomp;if(/E.coli/){@arr=split /\t/,$_;$file_name=$arr[1];$file_name=~s/.ecoli//g;$S=10000/$arr[5];print "bedtools genomecov -bg -scale $S -g /public/home/jlwang_gibh/project/00.data/03.GRCh38/genome_information.txt -i /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/05.filtering_and_conversion/${file_name}.qualityScore2.fragments.bed >${file_name}.qualityScore2.fragments.normalized.bedgraph\n"}}' summary.txt>>Spike_in.sh
ParaFly -c Spike_in.sh -CPU 10
08.鉴定Peak
01.使用SEACR进行peak的calling
因为不知道原文流程中对Control样本是如何选择的(有两个IgG重复样本),所以只能全部都用来做Control。
cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/07.Peak_calling/01.SEACR
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/06.spike_in/K*fragments.normalized.bedgraph`;do dir_path=$(dirname $i); file_name=$(basename $i .qualityScore2.fragments.normalized.bedgraph);echo "/public/home/jlwang_gibh/opt/bin/SEACR-1.3/SEACR_1.3.sh $i $dir_path/IgG_rep1.qualityScore2.fragments.normalized.bedgraph non stringent $file_name.IgG_rep1">>$file_name.SEACR.peakcalling.sh;echo "/public/home/jlwang_gibh/opt/bin/SEACR-1.3/SEACR_1.3.sh $i 0.01 non stringent $file_name.0.01">>$file_name.SEACR.peakcalling.sh;echo "/public/home/jlwang_gibh/opt/bin/SEACR-1.3/SEACR_1.3.sh $i $dir_path/IgG_rep2.qualityScore2.fragments.normalized.bedgraph non stringent $file_name.IgG_rep2">>$file_name.SEACR.peakcalling.sh;done
01.进行peaks数量的统计
for i in `ls *SEACR.peakcalling.sh`;do echo "bash $i">>SEACR.peakcalling.sh;done
ParaFly -c SEACR.peakcalling.sh -CPU 10
#对peak文件进行统计
perl -e '@file= `ls *stringent.bed`;foreach $file(@file){$file=~/(K.*)_(rep\d).(.*).stringent.bed/;$Histone=$1;$Replicate=$2;$Control=$3;$peak_number=`wc -l $file`;$peak_number=~/(\d+) /;$peak_number=$1;print "$Histone\t$Replicate\t$Control\t$peak_number\n"}'>SEACR_calling_summary.txt
统计结果:
02.统计峰的重复性
通过比对重复样品各自获得的峰,寻找重复样品之间都存在的峰,定义为可重复峰。其中进一步选择前1%的峰(由每个区块的peak强度)作为高置信度位点。
library(GenomicRanges)
setwd("C:/Users/Kong_lab_user/Desktop")
#设置样品名称
histL = c("K27me3", "K4me3")
repL = paste0("rep", 1:2)
peakType = c("0.01", "IgG_rep1","IgG_rep2")
peakOverlap_infor<-data.frame(matrix(ncol = 3,nrow = 0))
colnames(peakOverlap_infor)<-c("peakReprod", "Histone", "Control")
#读取数据并查看重复之间的overlap
for(type in peakType){
for(hist in histL){
overlap.gr = GRanges()
for(rep in repL){
peakInfo = read.table(paste0(hist,"_",rep,".",type,".stringent.bed"), header = FALSE, fill = TRUE)
#将起始数据转换为GRanges格式的位置信息数据
peakInfo.gr = GRanges(peakInfo$V1, IRanges(start = peakInfo$V2, end = peakInfo$V3), strand = "*")
if(length(overlap.gr) >0){
#查看不同rep的peak之间是否存在overlap
overlap.gr = overlap.gr[findOverlaps(overlap.gr, peakInfo.gr)@from]
}else{
overlap.gr = peakInfo.gr
}
}
#统计overlap peak之间的数量
peakOverlap = data.frame(peakReprod = length(overlap.gr), Histone = hist, Control = type)
peakOverlap_infor<-rbind(peakOverlap_infor, peakOverlap)
}
}
#读取080101中peak数量统计信息,并与此步骤获取的overlap信息合并
peakN<-read.table("PeakN.txt",header = T,sep = "\t")
peakReprod = left_join(peakN, peakOverlap_infor, by = c("Histone", "Control"))
#计算重现率:`# peaks overlapping rep1 and rep2/# peaks of rep1 or rep2 * 100
peakReprod$peakReprodRate=peakReprod$peakReprod/peakReprod$Peak_number*100
write.table(peakReprod,file="peakReprod.txt",sep="\t")
03.FRagment proportion in Peaks regions (FRiPs)的计算
计算FRiPs用于衡量信噪比,并将其与IgG进行对比。
library(GenomicRanges)
library(dplyr)
library(chromVAR)
setwd("C:/Users/Kong_lab_user/Desktop")
#设置sample名称
sampleList = c("K27me3_rep1", "K27me3_rep2", "K4me3_rep1", "K4me3_rep2", "IgG_rep1", "IgG_rep2")
histL = c("K27me3", "K4m3")
repL = c("rep1", "rep2")
peakTypes = c("0.01", "IgG_rep1","IgG_rep2")
inPeakData = c()
#循环读取peak的bed文件和比对的bam文件,并计算出位于每个peak区域的reads数量
for(hist in histL){
for(rep in repL){
for(peakType in peakTypes){
#读取peak的bed文件
peakRes = read.table(paste0(hist, "_", rep, ".",peakType,".stringent.bed"), header = FALSE, fill = TRUE)
#转为GRanges格式,以便后续的分析
peak.gr = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*")
#读取bam文件
bamFile = paste0(hist, "_", rep, ".qualityScore2.mapped.bam")
#获取位于peak上的reads数量
fragment_counts <- getCounts(bamFile, peak.gr, paired = TRUE, by_rg = FALSE, format = "bam")
inPeakN = counts(fragment_counts)[,1]
inPeakData = rbind(inPeakData, data.frame(inPeakN = inPeakN, Histone = hist, Replicate = rep,peakTypes=peakType))
}
}
}
#读取0303步骤中获得的比对情况表
mapping_info<-read.table("mapping_info.txt",header = T,sep="\t")
Histone<-c()
rep<-c()
#整合信息表
for(i in rownames(mapping_info)){
Histone<-c(Histone,strsplit(mapping_info[i,"Samples"],"_")[[1]][1])
rep<-c(rep,strsplit(mapping_info[i,"Samples"],"_")[[1]][2])
}
mapping_info$Histone<-Histone
mapping_info$Replicate<-rep
#提取所需的部分信息
mapping_info_hg38<-mapping_info[mapping_info$Genome%in%"hg38" & mapping_info$Histone %in% histL,]
mapping_info_hg38$Overall.alignment.rate<-as.numeric(gsub("%","",mapping_info_hg38$Overall.alignment.rate))
mapping_info_hg38$MappedFragNum_hg38<-mapping_info_hg38$Total.reads*mapping_info_hg38$Overall.alignment.rate/100
#将0303步骤获取的信息与peak上reads数量的信息整合
summary_inPeakData<-data.frame(matrix(ncol=4,nrow=0))
colnames(summary_inPeakData)<-c("Histone","Replicate","peakTypes","Sum of reads in peak")
for(i in unique(inPeakData$Histone)){
for(j in unique(inPeakData$Replicate)){
for(m in unique(inPeakData$peakTypes)){
rowname_tmp<-paste(i,j,m)
#获取样品中比对至peak区域的reads总数
sum1<-sum(inPeakData[inPeakData$Histone%in%i & inPeakData$Replicate %in% j & inPeakData$peakTypes%in% m,"inPeakN"])
summary_inPeakData[rowname_tmp,]=c(i,j,m,sum1)
}
}
}
#合并两者信息
summary_inPeakData<-left_join(summary_inPeakData,mapping_info_hg38,by=c("Histone", "Replicate"))
summary_inPeakData$`Sum of reads in peak`<-as.numeric(summary_inPeakData$`Sum of reads in peak`)
#计算FRiPs,分母为比对至hg38基因组上的序列数量
summary_inPeakData$FRiPs<-summary_inPeakData$`Sum of reads in peak`/summary_inPeakData$MappedFragNum_hg38
#输出summary_inPeakData
write.table(summary_inPeakData,file = "summary_inPeakData.txt",sep="\t")
结果:
09.可视化
01.使用IGV进行可视化
将由07步骤获得的bedgraph文件输入到IGV进行可视化,结果:
02.绘制heatmap进行可视化——转录单元
这边使用的比对文件为0601步骤获取的去除低质量比对的sam文件,去进行可视化。其中computeMatrix的-R参数文件由http://genome.ucsc.edu/cgi-bin/hgTables下载,仅仅取10000个基因作图(加快作图速度)。
cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/08.Visualization
#生成bw文件
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/05.filtering_and_conversion/*qualityScore2.sam`;do dir_path=$(dirname $i); file_name=$(basename $i .qualityScore2.sam); echo "samtools view -@ 16 -b -S $i -o $file_name.qualityScore2.bam" >>$file_name.heatmap_Visualization.sh; echo "samtools sort -o $file_name.qualityScore2.sorted.bam $file_name.qualityScore2.bam " >>$file_name.heatmap_Visualization.sh; echo "samtools index $file_name.qualityScore2.sorted.bam" >>$file_name.heatmap_Visualization.sh ; echo "bamCoverage -b $file_name.qualityScore2.sorted.bam -o $file_name.qualityScore2.bw ">>$file_name.heatmap_Visualization.sh;done
for i in `ls *heatmap_Visualization.sh`;do echo "bash $i">>heatmap_Visualization.sh;done
ParaFly -c heatmap_Visualization.sh -CPU 10
#进行图形的绘制
computeMatrix scale-regions -S IgG_rep1.qualityScore2.bw \
IgG_rep2.qualityScore2.bw \
K27me3_rep1.qualityScore2.bw \
K27me3_rep2.qualityScore2.bw \
K4m3_rep1.qualityScore2.bw \
K4m3_rep2.qualityScore2.bw \
-R test1 \
--beforeRegionStartLength 3000 \
--regionBodyLength 5000 \
--afterRegionStartLength 3000 \
--skipZeros -o matrix_gene.mat.gz -p 8
结果:
03.绘制heatmap进行可视化——Peak
#生成用于绘制图形的peak bed文件、bw文件,并进行图形的绘制
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/07.Peak_calling/01.SEACR/*stringent.bed`;do dir_path=$(dirname $i); file_name=$(basename $i .stringent.bed); file_name1=${file_name%%.*};echo "awk '{split(\$6, summit, \":\"); split(summit[2], region, \"-\"); print summit[1]\"\t\"region[1]\"\t\"region[2]}' $i>$file_name.summitRegion.bed" >>$file_name.heatmap_peak.sh; echo "computeMatrix reference-point -S /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/08.Visualization/01.heatmap/$file_name1.qualityScore2.bw -R $file_name.summitRegion.bed --skipZeros -o $file_name.SEACR.mat.gz -p 8 -a 3000 -b 3000 --referencePoint center">>$file_name.heatmap_peak.sh; echo "plotHeatmap -m $file_name.SEACR.mat.gz -out $file_name.png --sortUsing sum --startLabel \"Peak Start\" --endLabel \"Peak End\" --xAxisLabel \"\" --regionsLabel \"Peaks\" --samplesLabel \" $file_name1\" ">>$file_name.heatmap_peak.sh;done
for i in `ls *heatmap_peak.sh`;do echo "bash $i">>heatmap_peak.sh;done
ParaFly -c heatmap_peak.sh -CPU 1
K4me3 peak的图形:
K27me3 peak的图形:
10.差异分析
使用负二项分布模型进行差异分析,使用包为DESeq2。一般要进行差异分析的数据应该为同一类型的,但因为这边数据的限制,使用不同histone的数据进行演示。
library(GenomicRanges)
library(dplyr)
library(chromVAR)
library(DESeq2)
setwd("C:/Users/Kong_lab_user/Desktop")
#设置sample名称
sampleList = c("K27me3_rep1", "K27me3_rep2", "K4me3_rep1", "K4me3_rep2", "IgG_rep1", "IgG_rep2")
histL = c("K27me3", "K4m3")
repL = c("rep1", "rep2")
peakTypes = c("0.01", "IgG_rep1","IgG_rep2")
inPeakData = c()
#循环读取peak的bed文件,并整合进mPeak
for(hist in histL){
for(rep in repL){
for(peakType in peakTypes){
#读取peak的bed文件
peakRes = read.table(paste0(hist, "_", rep, ".",peakType,".stringent.bed"), header = FALSE, fill = TRUE)
#将所有peak储存与mpeak中
mPeak = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*") %>% append(mPeak, .)
}
}
}
#合并具有overlap的peak,保留主峰
masterPeak = reduce(mPeak)
#获取主峰列表中峰的比对数
countMat = matrix(NA, length(masterPeak), length(histL)*length(repL))
## overlap with bam file to get count
i = 1
for(hist in histL){
for(rep in repL){
bamFile = paste0( hist, "_", rep, ".qualityScore2.mapped.bam")
fragment_counts <- getCounts(bamFile, masterPeak, paired = TRUE, by_rg = FALSE, format = "bam")
countMat[, i] = counts(fragment_counts)[,1]
i = i + 1
}
}
colnames(countMat) = paste(rep(histL, 2), rep(repL, each = 2), sep = "_")
#进行测序深度的归一化和差异peak的检测
##删除low counts的峰
selectR = which(rowSums(countMat) > 5)
dataS = countMat[selectR,]
condition = factor(rep(histL, each = length(repL)))
##构建DDS对象
dds = DESeqDataSetFromMatrix(countData = dataS,
colData = DataFrame(condition),
design = ~ condition)
#进行差异分析
DDS = DESeq(dds)
#进行归一化
normDDS = counts(DDS, normalized = TRUE) ## normalization with respect to the sequencing depth
colnames(normDDS) = paste0(colnames(normDDS), "_norm")
res = results(DDS, independentFiltering = FALSE, altHypothesis = "greaterAbs")
#合并结果
countMatDiff = cbind(dataS, normDDS, res)
结果: