Relate 估计位点的起源时间

1、转换文件格式

从 VCF 文件转换为 relate 软件可以读取的 hap 格式：

PATH_TO_RELATE/bin/RelateFileFormats \

--mode ConvertFromVcf \

--haps example.haps \

--sample example.sample \

-i example

--haps 指定输出的.haps文件路径。

--sample 指定输出的.sample文件路径。

-i, --input 输入文件的基名（不带扩展名），并优先选择非压缩文件（如无压缩文件则使用压缩文件）。

--chr 可选参数：指定染色体索引（整数类型），将作为HAPS文件的第一列。默认值：0。

输出文件为 example.haps 和 example.sample。

# 这里的 VCF 文件需要进行 phase。

java -Xmx163840m -jar beagle.18May20.d20.jar \

gt=chr01.vcf \

out=chr01.phased \

window=8 #默认40，缩小可以提高计算速度，增加可以提高估计的准确性 \

ne=100000 #默认10e6 \

nthreads=8

2、估计谱系进化历史

首先需要进行一次简单的计算，后续所有计算都需要在此基础上运行，运算时间较短。

PATH_TO_RELATE/bin/Relate \

--mode All \

-m 1.25e-8 \

-N 30000 \

--haps data/example.haps \

--sample data/example.sample \

--map data/genetic_map.txt \

--annot data/example.annot \

--seed 1 \

-o example

输出文件分别为 example.anc 和 example.mut。

--mode 指定Relate的运行模式。模式"All"将执行算法的所有阶段，其他模式可用于单独执行特定阶段（参见文档）。此功能在并行化运行Relate时非常有用。

-m, --mutation_rate 每代每个碱基的突变率。

-N, --effective_size 单倍型的有效群体大小。（指单倍型而非个体！若需计算个体的有效群体大小，需将单倍型的有效群体大小乘以2）

example.haps 和 example.sample 为上一步骤生成的文件；

genetic_map.txt 则需要单独生成，具体格式如下：

将三列中的第i个元素分别记为p[i]、r[i]、rdist[i]，通过以下公式计算：r[i]=( (rdist[i+1]−rdist[i])/(p[i+1]−p[i]) )×1e6。

example.annot 是可选项，如果想放，按如下方式生成：

PATH_TO_RELATE/bin/RelateFileFormats \

--mode GenerateSNPAnnotations \

--haps example.haps \

--sample example.sample \

--poplabels example.poplabels \

--ancestor ancestor.fa \

-o example

example.poplabels 文件指定样本所在群体信息，按如下格式整理：

ancestor.fa 是可选项，如果要放，需要是和数据集比对后的 fasta 格式。

3、估计有效群体大小

PATH_TO_RELATE/scripts/EstimatePopulationSize/EstimatePopulationSize.sh \

-i example \

-m 1.25e-8 \

--poplabels data/example.poplabels \

--seed 1 \

-o example_popsize

较上一步骤，该步骤需要的时间较久。

这里 -i 的输入文件为上一步骤生成的 .anc 和 .mut 文件，无需后缀。

此外，有一些可选项可供选择：

--noanc 可选参数：指定是否在最后重新推断分支长度。0：重新推断，1：不重新推断。默认值：0。

--pop_of_interest 可选参数：指定需要估计种群大小的群体。使用.poplabels文件的第二列作为群体标签。可以指定多个群体，用逗号分隔（例如pop1,pop2）。使用此选项时，建议（但非必需）在运行Relate时添加--annot。如果未使用--annot运行Relate，也可后续补充（详见文档）。

--first_chr& --last_chr 可选参数：允许多个输入文件，假设文件名按染色体索引编号。示例：-i example --first_chr 1 --last_chr 2 会假设输入文件为 example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut。

--chr 可选参数：包含染色体名称的文件（每行一个名称）。此选项会覆盖--first_chr和--last_chr的设置。示例：-i example --chr chr.txt 会假设输入文件为example_chr*.anc,example_chr*.mut。

--threads 可选参数：最大线程数。默认值：1。

--bins 可选参数：指定分箱区间。格式为lower,upper,stepsize，对应函数c(0,10^seq(lower, upper, stepsize))。

--years_per_gen 可选参数：每代的年数。默认值：28.0。

--num_iter 可选参数：算法迭代次数。默认值：5。

--threshold 可选参数：[0,1]范围内的浮点数，表示按突变映射数量排序后需丢弃的树的比例。此参数有两个作用：1) 移除低质量/无信息量的树；2) 加速推断。值越大，丢弃的树越多。默认值：0.5。

--seed 可选参数：用于估计分支长度的MCMC算法中的随机数生成器种子。

输出文件：

example_popsize.pdf 展示估计的种群大小的图表。

example_popsize.anc.gz, example_popsize.mut.gz, example_popsize.dist

如果指定--noanc 1，则不会输出这些文件。包含更新分支长度后的谱系信息。无论threshold参数值如何，均会保留所有树的信息。若指定了--noanc 1，可通过独立运行 ReEstimateBranchLengths.sh 脚本重新估计分支长度。

example_popsize.coal 包含将所有样本视为单一群体时的共祖率（coalescence rates）和交叉共祖率（cross-coalescence rates）。

example_popsize.pairwise.coal 和 .bin 共祖率文件及其对应的二进制文件，记录样本对之间的共祖率。可通过 RelateCoalescentRates --mode FinalizePopulationSize 提取任意 .poplabels文件的共祖率。

example_popsize_avg.rate 包含时间平均的突变率。

4、重新估计分支长度

基于上述估计的有效群体大小重新估计分支长度。

PATH_TO_RELATE/scripts/SampleBranchLengths/ReEstimateBranchLengths.sh \

-i example \

-o example_sub \

-m 1.25e-8 \

--coal popsize.coal \

--first_bp 100000 \

--last_bp 200000 \

--seed 1

-i, --input 输入文件名（不包含扩展名），应为.anc和.mut文件的名称。需要包含输入haps/sample文件中的所有 SNP。如果未包含所有 SNP，请参见--dist选项。

-o, --output 输出文件的基名（不包含扩展名）。

-m, --mu 突变率，例如1.25e-8。

--coal 估计的共祖率，将所有单倍型视为一个群体。

--threads 可选参数：最大线程数。默认值为 1。

--first_bp 可选参数：起始碱基对位置（bp）。

--last_bp 可选参数：结束碱基对位置（bp）。

--dist 可选参数：包含 SNP 碱基对位置及其相邻 SNP 间距离的文件名（文件格式）。例如，可通过RelateExtract --mode AncMutForSubregion获取。注意：如果.anc/.mut文件仅覆盖输入haps/sample文件的子区域，则必须提供此文件。

--seed 可选参数：用于分支长度估计的随机种子。

输出文件为 example_sub.anc/mut.

5、估计突变率

计算随时间变化的突变率：

PATH_TO_RELATE/bin/RelateMutationRate \

--mode Avg\

-i example \

-o example

-i, --input 输入文件名（不包含扩展名），应为.anc和.mut文件的名称。

-o, --output 输出文件的基名（不包含扩展名）。

--first_chr & --last_chr 可选参数：允许多个输入文件，假设文件名按染色体索引编号。示例：-i example --first_chr 1 --last_chr 2会假设输入文件为example_chr1.anc,example_chr1.mut,example_chr2.anc,example_chr2.mut。

--chr 可选参数：包含染色体名称的文件（每行一个名称）。此选项会覆盖--first_chr和--last_chr的设置。示例：-i example --chr chr.txt会假设输入文件为example_chr*.anc,example_chr*.mut。

--bins 可选参数：指定分箱区间。格式为 lower, upper, stepsize，对应函数c(0,10^seq(lower, upper, stepsize))。

--years_per_gen 可选参数：每代的年数。默认值：28.0。

--dist 可选参数：指定.dist文件。该文件存储 SNP 之间的碱基对距离（单位：bp）。如果未指定，则使用.mut文件中提供的距离。仅在通过某些功能（如此函数）移除部分 SNP（及对应的树）时需要此选项。.dist文件可通过此功能生成。注意：如果与--first_chr或--last_chr同时指定，请提供不带扩展名的文件名。

输出文件为：example_avg.rate.

6、检测正选择

计算谱系选择证据的p值。建议先估计种群大小历史，并设置 --threshold 0。此操作会基于估计的种群大小历史更新所有树的分支长度。

PATH_TO_RELATE/scripts/DetectSelection/DetectSelection.sh \

-i example \

-o example_selection \

-m 1.25e-8 \

--first_bp 1000000 \

--last_bp 1000000 \

--years_per_gen 28 \

--coal popsize.coal

-i, --input 输入文件名（不包含扩展名），应为.anc和.mut文件的名称。需要包含haps/sample输入文件中的所有 SNP。

-o, --output 输出文件的基名（不包含扩展名）。

-m, --mu 突变率，例如1.25e-8。

--first_bp 可选参数：起始碱基对位置（bp）。

--last_bp 可选参数：结束碱基对位置（bp）。

--years_per_gen 可选参数：每代的年数。默认值：28.0。

--coal 可选参数：将所有单倍型视为一个群体时的共祖率估计。如果.anc/.mut文件是通过EstimatePopulationSize.sh脚本获得的，则不需要此参数；否则强烈建议提供。

--seed 可选参数：用于分支长度估计的随机种子。仅在指定--coal时使用。

输出文件

.freq 文件：记录衍生等位基因在genN到gen1世代（文件头中指定）的频率。

.lin 文件：记录树中谱系的数量，覆盖genN到gen1世代（文件头中指定），以及突变频率为2时的谱系数量。

.sele 文件：记录选择证据的 log10 p 值，覆盖genN到gen1世代（文件头中指定），以及突变频率为2时的 p 值。

获得选择的 P 值：

PATH_TO_RELATE/bin/RelateSelection \

--mode Frequency \

-i example \

-o example

输出文件为：example.freq, example.lin.

将上面的两个文件作为输入文件（example.freq, example.lin），计算每个SNP位点的P值：

PATH_TO_RELATE/bin/RelateSelection \

--mode Selection \

-i example \

-o example

输出文件为：example.sele。

7、推断样本支长

PATH_TO_RELATE/scripts/SampleBranchLengths/SampleBranchLengths.sh \

-i example \

-o example_sub \

-m 1.25e-8 \

--coal popsize.coal \

--format a \

--num_samples 5 \

--first_bp 100000 \

--last_bp 200000 \

--seed 1

-i, --input 输入文件名（不包含扩展名），应为.anc和.mut文件的名称。需要包含输入haps/sample文件中的所有 SNP。若未包含所有 SNP，请参考--dist选项。

-o, --output 输出文件的基名（不包含扩展名）。

-m, --mu 突变率，例如1.25e-8。

--coal 估计共祖率（coalescent rate），将所有单倍型视为一个群体。

--format 输出文件格式：'a': anc/mut 格式（默认）。'n': newick/sites 格式（供 CLUES 使用）。'b': 二进制格式（供 CLUES 和 PALM 使用）注意：anc/mut 格式会修改以支持存储同一分支的多个长度。

--num_samples 分支长度采样次数。需为整数且 ≥1。

--num_proposals 可选参数：采样间隔的 MCMC 提议次数。默认值为max(10*N, 1000)（N 为单倍型数量）。若设为0，所有样本将与输入一致。

--first_bp 可选参数：起始碱基对位置（bp）。

--last_bp 可选参数：结束碱基对位置（bp）。

--dist 可选参数：包含 SNP 的碱基对位置及相邻 SNP 间距的文件名（文件格式）。例如，通过RelateExtract --mode AncMutForSubregion生成。注意：若.anc/.mut文件仅覆盖输入haps/sample文件的子区域，则必须提供此文件。

--seed 可选参数：分支长度估计的随机种子。

输出文件为 example_sub.anc/mut 或 example_sub.newick/.sites 或 example_sub.timeb.

该步骤的输出文件可以用于后续的 CLUES 或 PALM 分析，我后面有时间会写一下如何利用 CLUES 推断位点的受选择时间。

详见：CLUES 推断位点的受选择时间（进化轨迹）

8、绘制树图

PATH_TO_RELATE/scripts/TreeView/TreeView.sh \

--haps data/example.haps \

--sample data/example.sample \

--anc example.anc \

--mut example.mut \

--poplabels data/example.poplabels \

--bp_of_interest 3000000 \

--years_per_gen 28 \

-o example

基于Relate计算的结果便可计算。

PATH_TO_RELATE/scripts/TreeView/TreeViewMutation.sh \

--haps data/example.haps

--sample data/example.sample \

--anc example.anc \

--mut example.mut \

--poplabels data/example.poplabels \

--bp_of_interest 3000000 \

--years_per_gen 28 \

-o example

在树图中，高亮显示携带衍生等位的样本。

PATH_TOcowplot/TreeView/TreeViewSamples.sh \

--haps data/example.haps

--sample data/example.sample \

--anc example.anc \

--mut example.mut \

--dist example.dist \

--poplabels data/example.poplabels \

--bp_of_interest 3000000 \

--years_per_gen 28 \

-o example

这个要用 SampleBranchLengths 的输出文件，在每个节点添加置信区间。