Zero 全基因组比对
比对软件参考:https://github.com/carbonelab/lastz-pipeline/tree/main
准备文件:
target.fa target.2bit target.size
quey.fa qury.2bit qury.size
- 更改lastz-pipe.R的比对参数
1. 若物种分化时间很近,例如人和猩猩,使用默认distance="near"
2. 若物种分化时间超过100MYA,使用distance="far"
3. 其他使用distance="medium"
- 运行
Rscript lastz-pipe.R -t target.2bit -q qury.2bit -T target.size -Q qury.size -d /abo_path_to_output -p 32
输出为 target.qury.all.chain
One 单基因组AB区室统计
Pan-3D genome analysis reveals structural and functional diferentiation of soybean genomes —Figure 1
在这之前,生成每个物种的AB区室结果,接hic-pro的输出
run_call_AB.sh prefix bin_size annoate.gff prefix.ref.fasta prefix.ref.fasta.fai /path_to_01.cal-corr.py
其中:
prefix :物种名称
bin_size :分辨率
annoate.gff :注释文件
prefix.ref.fasta :参考基因组
prefix.ref.fasta.fai : 索引
/path_to_01.cal-corr.py: 指向脚本的绝对路径
eg.
run_call_AB.sh golani 100000 golani.chr.v4.gff Golani.Chr.v4.fasta Golani.Chr.v4.fasta.fai /path_to_01.cal-corr.py
最后结果为:prefix.combine.PC1.bed
输入:
物种的染色体位置及PC1的值,如图

微信截图_20241121213921.png
其中,正值是经过校正的PC1的值,代表A,反之代表B,空值为None
python3 01.py carmeli.PC1.bed > carmeli.AB.bed
output file format:
Chr1 0 50000 None
Chr1 50000 100000 B
Chr1 100000 150000 B
Chr1 150000 200000 A
Chr1 200000 250000 B
Chr1 250000 300000 None
...
python3 02.py 01.output species_name > prefix.AperEvChr.stat
eg. python3 02.py carmeli.AB.bed carmeli > carmeli.AperEvChr.stat
output file format: prefix.AperEvChr.stat
prefix chr_name A_count all_count A_percentage
carmeli Chr1 2777 4991 0.5564015227409337
carmeli Chr2 1824 4570 0.39912472647702407
carmeli Chr3 1424 3590 0.3966573816155989
carmeli Chr4 1392 3576 0.38926174496644295
...
python3 03.py 02.output perfix.GC.stat > prefix.AperEvChr_GC.stat
eg. python3 03.py golani.AperEvChr.stat golani.GC.stat > golani.AperEvChr_GC.stat
perfix.GC.stat:
Chr1 40.31
Chr2 40.19
Chr3 40.52
Chr4 40.34
...
it can be got by :
seqkit fx2tab prefix.ref.fa -g -n > prefix.GC.stat
output file format: prefix.AperEvChr_GC.stat
prefix chr_name GC A_per
golani Chr1 40.31 0.42178301093355763
golani Chr2 40.19 0.36876640419947504
golani Chr3 40.52 0.4832324173265021
python3 04.py prefix.1Mwin.bed prefix.1Mwin_TSS.output prefix.AperEvChr_GC.stat > prefix.AperEvChr_GC_TSS.stat
eg. python3 04.py carmeli.1Mwin.bed carmeli.1Mwin_TSS.output carmeli.AperEvChr_GC.stat > carmeli.AperEvChr_GC_TSS.stat
prefix.1Mwin.bed :
it can be got by :
bedtools makewindows -g carmeli.size -w 100000 > carmeli.1Mwin.bed
prefix.TSS.bed
it can be got by :
cat prefix.anno.gff | awk '$3=="gene"{print$0}' | awk '{print$1"\t"$4"\t"$4"\t"$9}' > prefix.TSS.bed
prefix.1Mwin_TSS.output
it can be got by :
bedtools intersect -a carmeli.1Mwin.bed -b carmeli.TSS.bed -wo > carmeli.1Mwin_TSS.output
output file format: prefix.AperEvChr_GC_TSS.stat
prefix chr_name GC A_per TSS_density(/Mb)
carmeli Chr1 40.79 0.5564015227409337 0.7439903846153846
carmeli Chr2 40.17 0.39912472647702407 0.7452954048140044
carmeli Chr3 43.12 0.3966573816155989 0.8044568245125349
...
python3 05.py prefix.size prefix.Chr_repeat.locat prefix.AperEvChr_GC_TSS.stat > prefix.AperEvChr_GC_TSS_repeat.stat
eg. python3 05.py carmeli.size carmeli.Chr_repeat.locat carmeli.AperEvChr_GC_TSS.stat > carmeli.AperEvChr_GC_TSS_repeat.stat
prefix.size
chr_id size
Chr1 249545545
Chr2 228452398
Chr3 179495747
it can be got by
its easy you know ? : )
prefix.Chr_repeat.locat
it can be got by :
cat prefix.repeat.chr.gff | cut -f 1,4,5 | bedtools merge -i - | grep "Chr" > carmeli.Chr_repeat.locat
output file format: prefix.AperEvChr_GC_TSS_repeat.stat
prefix chr GC A_per TSS_density(/Mb) repeat_density
carmeli Chr1 40.79 0.5564015227409337 0.7439903846153846 0.5028114727513969
carmeli Chr2 40.17 0.39912472647702407 0.7452954048140044 0.49054962863642165
carmeli Chr3 43.12 0.3966573816155989 0.8044568245125349 0.5070431835914195
...
python3 06.py prefix.AB.bed > prefix.AB_I.tmp
paste -d "\t" prefix.AB.bed prefix.AB_I.tmp > prefix.AB_I.bed
eg.
python3 06.py carmeli.AB.bed > carmeli.AB_I.tmp
paste -d "\t" carmeli.AB.bed carmeli.AB_I.tmp > carmeli.AB_I.bed
output file format: prefix.AB_I.bed
chr start end A/B I_stat
Chr1 0 50000 None I
Chr1 50000 100000 B I
Chr1 100000 150000 B I
Chr1 150000 200000 A I
...
python3 07.py prefix.AB_I.bed prefix > prefix.AB_I.stat
eg.
python3 07.py carmeli.AB_I.bed carmeli > carmeli.AB_I.stat
output file format: prefix.AB_I.stat
prefix chr I_percentage
carmeli Chr1 0.4898817872169906
carmeli Chr2 0.11575492341356675
carmeli Chr3 0.2908077994428969
carmeli Chr4 0.8671700223713646
...
python3 08.py prefix.AB_I.bed prefix.50kwin_TSS.output prefix.50kwin.GC.bed prefix.50kwin.repeat.bed > prefix.AB_I_GC_repeat_TSS.stat
eg.
python3 08.py carmeli.AB_I.bed carmeli.50kwin_TSS.output carmeli.50kwin.GC.bed carmeli.50kwin.repeat.bed > carmeli.AB_I_GC_repeat_TSS.stat
输入文件准备:
1. carmeli.50kwin_TSS.output:
bedtools intersect -a carmeli.50kwin.bed -b carmeli.TSS.bed -wo > carmeli.50kwin_TSS.output
2. carmeli.50kwin.GC.bed
bedtools nuc -fi Carmeli.Chr.v4.fasta -bed carmeli.50kwin.bed | cut -f 1-3,5 | grep-v "#" > carmeli.50kwin.GC.bed
3. carmeli.50kwin.repeat.bed
bedtools annotate -i carmeli.50kwin.bed -files carmeli.Chr_repeat.locat | bedtools sort -i - > carmeli.50kwin.repeat.bed
python3 09.py prefix.AB_I_GC_repeat_TSS.stat prefix > prefix.5type.stat
eg.
python3 09.py carmeli.AB_I_GC_repeat_TSS.stat carmeli > carmeli.5type.stat
Two 基于Pore-C的Pipeline
默认拥有.mcool文件,默认使用的分辨率为100000
1. mcool转为cool
hicConvertFormat -m prefix.mcool://resolutions/100000 -o prefix_100000.cool --inputFormat cool --outputFormat cool
2. mcool转为hic https://www.biostars.org/p/360254/
cooler dump --join prefix.mcool::/resolutions/100000 > prefix_bedpe
awk -F "\t" '{print 0, $1, $2, 0, 0, $4, $5, 1, $7}' prefix_bedpe > prefix_bedpe.short
sort -k2,2d -k6,6d prefix_bedpe.short > prefix_bedpe.short.sorted
java -jar juicer_tools.jar pre -r 100000 prefix_bedpe.short.sorted prefix.hic prefix.size
3. cool转为dense matrix
hictk dump prefix_100000.cool > prefix_100000.matrix
bedtools makewindows -g prefix.size -w 100000 -i winnum > prefix.100k.bed
{python sparseToDense.py -b prefix.100k.bed prefix_100000.matrix --perchr} 可以套用 run_call_AB.sh
Three 基于HiC-PRO的Pipeline
默认拥有HiC-PRO的结果 分辨率为100000
1. allValidPairs转hic
hicpro2juicebox.sh -i sample1.allValidPairs -g prefix.size -j juicer_tools_1.22.01.jar
Four 计算绝缘系数
h1d one IS prefix.cool 100000 Chrx -p 300000 -o output --datatype cool --gt prefix.size
Five 标准化过程
对h5格式表转化
hicCorrectMatrix correct -m prefix.h5 --outFileName prefix.ice.h5 --filterThreshold -1.2 5