5. 文件格式转换
-bedpe 要求按read名排序把成对的排序到一起,warning可能是因为没排序,并不一定是数据本身的问题。
for i in `ls *_sortname.bam`;
do
i=${i/_sortname.bam/};
printf "prepare $i"
## Convert into bed file format
bedtools bamtobed -i ${i}_sortname.bam -bedpe > /mnt/g/linux/20231129cuttag/bam/bed1/${i}_bowtie2.bed \
##Keep the read pairs that are on the same chromosome and fragment length less than 1000bp
awk '$1==$4 && $6-$2 < 1000 {print $0}' /mnt/g/linux/20231129cuttag/bam/bed1/${i}_bowtie2.bed > /mnt/g/linux/20231129cuttag/bam/bed1/${i}_bowtie2.clean.bed \
## Only extract the fragment related columns
cut -f 1,2,6 /mnt/g/linux/20231129cuttag/bam/bed1/${i}_bowtie2.clean.bed | sort -k1,1 -k2,2n -k3,3n > /mnt/g/linux/20231129cuttag/bam/bed1/${i}_bowtie2.fragments.bed
printf "finish $i";
done
6. 样本重复性
$samtools index ${i}_sortname.bam
$multiBamSummary bins --bamfiles *_sortname.bam --labels sample1 sample2 sample3 -out readCounts.npz --outRawCounts readCounts.tab
$plotCorrelation -in readCounts.npz -o results_heatmap.png --corMethod spearman -p heatmap
转录因子看组间差异和重复性有点鸡肋,没有组蛋白那么大的差别。我第一次做CUT&Tag没有spike-in校准,后续代码也就有些许差别。
7. CUT&Tag分析需要bedgraph格式
##基因组染色体大小的文件利用samtools进行提取
samtools faidx GRCh38.primary_assembly.genome.fa
该文件包含关于索引的 FASTA 文件中的序列名称、长度和字节偏移的信息,相对于实际序列数据而言是相当紧凑的。
for i in `ls *_bowtie2.fragments.bed`;
do
i=${i/_bowtie2.fragments.bed/};
printf "prepare $i"
bedtools genomecov -bg -i ${i}_bowtie2.fragments.bed \
-g /mnt/g/linux/chip/gencode/GRCh38.primary_assembly.genome.fa.fai > /mnt/g/linux/20231129cuttag/bam/bed1/bedgraph/${i}.fragments.bedgraph
printf "finish $i";
done
seacr中包含对输入格式及转换的具体代码
Here is some example code for converting from a paired-end BAM to a fragment bedgraph file as described above:
bedtools bamtobed -bedpe -i $sample.bam > $sample.bed
awk '$1==$4 && $6-$2 < 1000 {print $0}' $sample.bed > $sample.clean.bed
cut -f 1,2,6 $sample.clean.bed | sort -k1,1 -k2,2n -k3,3n > $sample.fragments.bed
bedtools genomecov -bg -i $sample.fragments.bed -g my.genome > $sample.fragments.bedgraph
其中也包含对一些具体参数的描述,一定要看README.md。
## Description of input fields:
Field 1: Target data bedgraph file in UCSC bedgraph format (https://genome.ucsc.edu/goldenpath/help/bedgraph.html) that omits regions containing 0 signal.
Field 2: Control (IgG) data bedgraph file to generate an empirical threshold for peak calling. Alternatively, a numeric threshold *n* between 0 and 1 returns the top *n* fraction of peaks based on total signal within peaks.
Field 3: “norm” denotes normalization of control to target data, “non” skips this behavior. "norm" is recommended unless experimental and control data are already rigorously normalized to each other (e.g. via spike-in).
Field 4: “relaxed” uses a total signal threshold between the knee and peak of the total signal curve, and corresponds to the “relaxed” mode described in the text, whereas “stringent” uses the peak of the curve, and corresponds to “stringent” mode.
Field 5: Output prefix
没有spike-in,一定选norm.
seacr="/mnt/g/linux/SEACR-master/SEACR_1.3.sh"
for i in `ls *.fragments.bedgraph`;
do
i=${i/.fragments.bedgraph/};
printf "prepare $i"
#Calls enriched regions in target data using normalized IgG control track with stringent threshold
bash $seacr ${i}.fragments.bedgraph IgG.fragments.bedgraph norm stringent /mnt/g/linux/20231129cuttag/bam/bed1/bedgraph/${i}.seacr_control.peaks
printf "finish $i";
done
homer验证或者寻找motif
perl /mnt/e/linux/configureHomer.pl -list#基因组hg38有无,无需要下载
findMotifsGenome.pl /mnt/g/linux/20231129cuttag/bam/bed1/bedgraph/sample-1.seacr_co
ntrol.peaks.stringent.bed /mnt/g/linux/data/genomes/hg38 /mnt/g/linux/20231129cuttag/homer/sample-1_motif -len 8,10,
12
7-2. bam文件合并及motif查找
生物学重复样本可以先进行bam文件合并
samtools merge -o merge/KD_sortname.bam KD1_sortname.bam KD2_sortname.bam
macs2 callpeak -t WT_sortname.bam -c KD_sortname.bam -g hs -f BAMPE -n kd_wt --keep-dup=all --outdir callpeak
#BED格式,包含peak summits(peak最高点)位置;如果想寻找结合位点#的motifs ,建议使用此文件。
findMotifsGenome.pl WT_kd_summits.bed /mnt/e/linux/data/genomes/hg38 WT_kd_motif -len 8,10,12