官网给的参数其实大多也是默认参数,建议使用管道符或者for循环,事半功倍,不然运行时间太长了。
1.检测数据完整性
公司拿到的数据无论是cleandata还是rawdata,必须经质控才可以进行下边的操作。我拿到的cleandata存在接头未去除的问题,比对基因组只有50% map,使用--detect_adapter_for_pe 参数。
for i in `ls 00.CleanData/`;
do
printf "prepare $i\n\n"
fastp --detect_adapter_for_pe \
-i /mnt/g/linux/20231129cuttag/00.CleanData/"$i"/"$i"_1.clean.fq.gz -o /mnt/g/linux/20231129cuttag/fastp/"$i"_1.fastq.gz\
-I /mnt/g/linux/20231129cuttag/00.CleanData/"$i"/"$i"_2.clean.fq.gz -O /mnt/g/linux/20231129cuttag/fastp/"$i"_2.fastq.gz\
-h "$i".html
printf "finish $i";
done
2.对比基因组并转换成bam格式
使用bowtie2进行参考基因组索引构建及序列比对,并使用管道符将输出文件直接转换成bam格式,会节省很多时间。
CUT&Tag测序应该50 M reads比较好,测序深度不够,要保证覆盖度 12G???
for i in `ls *_1.fastq.gz`;
do
i=${i/_1.fastq.gz/};
printf "${i}"
bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -x /mnt/g/linux/chip/gencode/genome \
-1 /mnt/g/linux/20231129cuttag/fastp/${i}_1.fastq.gz \
-2 /mnt/g/linux/20231129cuttag/fastp/${i}_2.fastq.gz | samtools sort -n -O bam -T ${i} > /mnt/g/linux/20231129cuttag/sam/"$i"_output_sorted.bam;
printf "finish ${i}";
done
一定要使用samtools sort -n,按read名排序把成对的排序到一起,这是后续代码成功运行的关键。
3. 提取paired reads中比对到参考基因组上的数据
根据FLAG进行过滤后(不可以省略),并没有根据MAPQ再次过滤,可以再次过滤选择。
$ samtools view -bF 12 ${i}_output_sorted.bam > ${i}.F12.bam
4. 根据MAPQ过滤+去重
#!/bin/bash
# This is for hisat2 batch aligne
for i in `ls *.F12.bam`;
do
i=${i/.F12.bam/};
printf "${i}"
samtools view -q 10 -O bam -o /mnt/g/linux/20231129cuttag/1219MAPQ/${i}_MAPQ.bam /mnt/g/linux/20231129cuttag/bam/${i}.F12.bam
printf "finish ${i}";
done
转录因子在进行序列对比后KO和WT细胞系mapped reads如果总体上未呈现趋势,可能是PCR重复。我们需要评估样本PCR重复,确定是否去重。
#!/bin/bash
# This is for bowtie2 batch aligne
for i in `ls *_MAPQ.bam`;
do
i=${i/_MAPQ.bam/};
printf "prepare $i"
./sambamba markdup -r /mnt/g/linux/20231129cuttag/1219MAPQ/${i}_MAPQ.bam /mnt/g/linux/20231129cuttag/1219MAPQ/${i}_dup.bam;
printf "finish $i";
done
4. 评估比对上的片段大小分布
CUT&Tag inserts adapters on either side of chromatin particles in the vicinity of the tethered enzyme, although tagmentation within chromatin particles can also occur. So, CUT&Tag reactions targeting a histone modification predominantly results in fragments that are nucleosomal lengths (~180 bp), or multiples of that length. CUT&Tag targeting transcription factors predominantly produce nucleosome-sized fragments and variable amounts of shorter fragments, from neighboring nucleosomes and the factor-bound site, respectively. Tagmentation of DNA on the surface of nucleosomes also occurs, and plotting fragment lengths with single-basepair resolution reveal a 10-bp sawtooth periodicity, which is typical of successful CUT&Tag experiments.
组蛋白的cut&tag会有核小体(200 bp)或者数倍于该长度的片段产生,但是转录因子跟结合位点结合的话就会有小片段。Tn5切割抗体结合位点,转录因子跟target序列结合后就会被切开,所以就会有不是核小体长度的小片段。笔者就是转录因子的测序结果,虽然没有官网提供的图片那么漂亮,但大致会有一个周期性片段分布的趋势,同时产生了一些小片段。
在150bp前出现第一个峰,这个峰值代表的就是切掉了开放的染色质区域,同时伴随着规律的锯齿状小峰,基本是10bp一个小锯齿。
在200bp左右处的峰值主要是由于切割核小体导致的,一个核小体缠绕的DNA大约在147bp,加上切割的时候不可能那么精确,所以大概在200bp。后续的峰值代表切下来两个、三个核小体以及n倍核小体情况,而且峰值是越来越低的,说明切割到染色质远端的核小体的概率是越来越低的。
测序偏好性??500bp以下测得序列准确
for i in `ls *.sortname.bam`;
do
i=${i/.sortname.bam/};
printf "prepare $i"
samtools view ${i}.sortname.bam | \
awk -F'\t' 'function abs(x){return ((x < 0.0) ? -x : x)} {print $1"\t"abs($9)}' | \
sort | uniq | cut -f2 > /mnt/g/linux/20231129cuttag/bam/bed/fragmentLen/${i}_fragmentLen.txt
printf "finish $i";
done
即使不理解代码的含义,也可以直接运行,官网也采取的是默认参数。
再用R可视化试试
data <- read.table("G:/linux/20231129cuttag/bam/bed/fragmentLen/sample_fragmentLen.txt")
# 设置插入片段长度的阈值,过滤掉太长的片段
length_cutoff <- 1200
fragment <- data$V1[data$V1 <= length_cutoff]
fragment
# 利用直方图统计频数分布,设置柱子个数
breaks_num <- 500
res <- hist(fragment, breaks = breaks_num, plot = FALSE)
# 添加坐标原点
plot(x = c(0, res$breaks),
y = c(0, 0, res$counts) / 10^2,
type = "l", col = "red",
xlab = "Fragment length(bp)",
ylab = expression(Normalized ~ read ~ density ~ 10^2),
main = "Sample Fragment sizes")
官网的图长这个样子,组蛋白就是很漂亮...