Juicer Tools 简介以及前期处理
Juicer 软件分析流程以及几大模块,如下图所示:
JUICER主要分为三个模块 JUICER Tools,JUICEBOX,STAW
JUICER Tools 主要用于数据分析,特征注释
JUICEBOX 主要用于Hi-C可视化
STAW 主要是数据说明
<u style="text-decoration: underline;">juicer 软件的基础文件为.hic 文件,这是一类高度压缩的二进制文件存储数据的交互信息。</u>
Juicer可以做点什么呢?
juicer 可以call AB call TAD call loop 以及对loop进行注释以及motif 识别,是一款集大成者的软件,如下图所示:
那么 .hic 文件是如何生成的呢?
我们一般用 juicer_tools 的pre 模块来生成.hic文件,输入文件是HiC Pro vaildpairs 文件(注意vaild Paires 文件格式要微调参见 hicpro2juicebox.sh
pre_vaildPairs 格式:
Usage:
必须输入的文件:infile path ,outfile path,genomesize
infile: 存储交互信息的text文件.具体格式如下:
注意要以空格分隔
格式一:
<readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>
格式二:
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>
str = strand (0 for forward, anything else for reverse)
chr = chromosome (must be a chromosome in the genome)
pos = position
frag = restriction site fragment
#其他格式请参考https://github.com/theaidenlab/juicer/wiki/Pre#file-format
outfile: 输出文件的路径,注意文件名要以.hic结尾
genomesize:两列 染色体名称以及染色体大小
简单使用实例:
java -Xmx10g -jar juicebox_tools.jar pre chrsvalidpair_sam1.chr10.validpairs.gz sam1.chr10.hic chrom_mm9.sizes
chrsvalidpair_sam1.chr10.validpairs.gz :
chrom_mm9.sizes:
两列: 染色体编号 染色体大小
chr1 197195432
chr2 181748087
chr3 159599783
chr4 155630120
chr5 152537259
chr6 149517037
chr7 152524553
详情请见:
java -Djava.io.tmpdir=/tmp -Djava.awt.headless=true -Djava.library.path=juice/lib64. -Xmx8000m -Xms5000m -jar juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar pre chrsvalidpair_sam1.chr10.validpairs.gz sam1.chr10.hic chrom_mm9.sizes
可选参数:
-d 只计算染色体内的交互 默认false
-f 根据酶切片段计算 需要 restriction site file
-m <int>只输出reads count 大于threadthod 的
-q <int>通过MAPQ score 过滤一部分数据只输出 MAPQ score大于或等于q的 [not set]
-c <chromosome id="">只计算某一条染色体 [not set]
-n 不对矩阵进行标准化
…</chromosome></int></int>
如果前期pre 处理的时候 我们选择不进行标准化,生成了.hic文件,而后期我们又想进行标准化,该如何操作呢?
我们可以使用addNorm模块
简单用法如下:
java -Xmx8000m -Xms5000m -jar juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar addNorm sam1.chr10.hic -w 10000 -F
参数说明:
input_HiC_file :输入.hic file
-w : Smallest resolution to calculate genome-wide resolution
-F :不对以酶切片段为分辨率的矩阵进行标准化
-d: For genome-wide normalization, include intra-chromosomal matrices; by default, inter-only matrices are used.
结果:.hic file 内容发生了改变
java -Djava.io.tmpdir= /tmp -Djava.awt.headless=true -Djava.library.path=juice/lib64 -Xmx8000m -Xms5000m -jar juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar addNorm sam1.chr10.hic -w 10000 -F
其核心代码:
https://github.com/theaidenlab/Juicebox/tree/master/src/juicebox/tools
此外针对Juicer内嵌的标准化方法,以下是详细说明:
Normalization of Hi-C maps
To normalize the Hi-C maps, several methods are implemented.
Iterative Correction (IC) [1] This method normalize the raw contact map by removing biases from experimental procedure. This is an method of matrix balancing, however, in the normalized, sum of rows and columns are not equal to one.
Knight-Ruiz Matrix Balancing (KR) [2] The Knight-Ruiz (KR) matrix balancing is a fast algorithm to normalize a symmetric matrix. A doubly stochastic matrix is obtained after this normalization. In this matrix, sum of rows and columns are equal to one.
Vanilla-Coverage (VC) [3] This method was first used for inter-chromosomal map. Later it was used for intra-chromosomal map by Rao et al., 2014. This is a simple method where at first each element is divided by sum of respective row and subsequently divided by sum of respective column.
来看一下标准化的效果~~
References
[1] Imakaev et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nature Methods 9, 999–1003 (2012).
[2] Knight P and D. Ruiz. A fast algorithm for matrix balancing. IMA J Numer Anal (2013) 33 (3): 1029-1047.
[3] Lieberman-Aiden et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science (2009) 326 : 289-293.