计算d-distance以及Heterozygosity_Ratio的计算

我首先要介绍一下d-distance的计算方法

1.遗传距离规则：

核苷酸相同（AA、TT、CC、GG）：距离为 0。
核苷酸类型相同（AG、GA、TC、CT）：距离为 1。
核苷酸类型不同（AT、TA、GC、CG、AC、CA、GT、TG）：距离为 2。
缺失位点 ./. 不参与计算（跳过）。

2.VCF文件解析

REF（第4列）表示参考等位基因。
ALT（第5列）表示替代等位基因。
基因型的表示方式：
    0/0：两个等位基因均为 REF。
    1/1：两个等位基因均为 ALT。
    0/1 或 1/0：分别是一个 REF 和一个 ALT。

3.计算公式

image.png

n 为总共参与计算的位点数（不包括缺失位点）。
D_i 为第 i 个位点的遗传距离。
脚本为

import pysam

# 定义一个函数，用于计算两个等位基因的遗传距离
def calculate_distance(allele1, allele2):
    """
    根据等位基因对计算遗传距离。
    :param allele1: 第一个等位基因（如 'A'）
    :param allele2: 第二个等位基因（如 'T'）
    :return: 遗传距离（0, 1, 或 2）
    """
    # 核苷酸相同，遗传距离为 0
    if allele1 == allele2:
        return 0
    # 核苷酸类型（嘌呤或嘧啶）相同，遗传距离为 1
    elif (allele1 in ['A', 'G'] and allele2 in ['A', 'G']) or (allele1 in ['C', 'T'] and allele2 in ['C', 'T']):
        return 1
    # 核苷酸类型不同，遗传距离为 2
    else:
        return 2

# 打开压缩的VCF文件
vcf_file = "SCT_combine119.call.snp.filtered.2allelic.DP.vcf.gz"
vcf = pysam.VariantFile(vcf_file)

# 获取所有个体的ID
individuals = list(vcf.header.samples)

# 初始化一个字典，用于存储每个个体的遗传距离总和
distance_data = {ind: 0 for ind in individuals}

# 统计总位点数
total_sites = 0

# 遍历VCF文件中的每一行（每个位点）
for record in vcf:
    # 获取参考等位基因（REF）和替代等位基因（ALT）
    ref = record.ref  # 参考等位基因（如 'A'）
    alt = record.alts[0] if record.alts else None  # 替代等位基因（如 'T'）

    # 如果没有替代等位基因，跳过该位点
    if not alt:
        continue

    # 位点计数 +1
    total_sites += 1

    # 遍历每个个体的基因型
    for ind in individuals:
        genotype = record.samples[ind]["GT"]  # 获取基因型（如 (0, 1)）

        # 如果基因型缺失（如 './.'），跳过该个体在该位点的计算
        if None in genotype:
            continue

        # 根据基因型索引确定两个等位基因
        allele1 = ref if genotype[0] == 0 else alt  # 第一个等位基因
        allele2 = ref if genotype[1] == 0 else alt  # 第二个等位基因

        # 计算该位点的遗传距离
        distance = calculate_distance(allele1, allele2)

        # 更新该个体的总遗传距离
        distance_data[ind] += distance

# 计算每个个体的基因组平均遗传距离
results = {}
for ind, total_distance in distance_data.items():
    # 平均距离 = 总距离 / 总位点数
    results[ind] = total_distance / total_sites

# 输出结果到终端和文件
output_file = "d-distance_result.txt"
with open(output_file, "w") as f:
    # 写入文件标题
    f.write("个体ID\t平均遗传距离\n")
    print("个体ID\t平均遗传距离")  # 同时输出到终端
    for ind, avg_distance in results.items():
        # 写入每个个体的结果到文件
        f.write(f"{ind}\t{avg_distance}\n")
        print(f"{ind}\t{avg_distance}")  # 同时输出到终端

计算杂合子比率

het_radio=杂合位点数/非参考纯合位点数
也就是N(0/1) / N(1/1)

import pysam

# 打开压缩的VCF文件
vcf_file = "SCT_118_0.05MAF_clean_fixed.recode.vcf"
vcf = pysam.VariantFile(vcf_file)

# 获取所有个体的ID
individuals = list(vcf.header.samples)

# 初始化每个个体的计数器，分别统计杂合子数和非参考纯合子数
het_counts = {ind: 0 for ind in individuals}  # 统计 0/1 的数量
hom_alt_counts = {ind: 0 for ind in individuals}  # 统计 1/1 的数量

# 遍历VCF文件中的每一行（每个位点）
for record in vcf:
    # 遍历每个个体的基因型
    for ind in individuals:
        genotype = record.samples[ind]["GT"]  # 获取基因型（如 (0, 1)）

        # 如果基因型缺失（如 './.'），跳过该个体在该位点的计算
        if None in genotype:
            continue

        # 检查基因型并计数
        if genotype == (0, 1) or genotype == (1, 0):  # 杂合子
            het_counts[ind] += 1
        elif genotype == (1, 1):  # 非参考纯合子
            hom_alt_counts[ind] += 1

# 计算每个个体的杂合子比率
results = {}
for ind in individuals:
    het_count = het_counts[ind]
    hom_alt_count = hom_alt_counts[ind]

    # 避免除以 0 的情况，如果没有非参考纯合位点，则设置比率为 None
    if hom_alt_count > 0:
        results[ind] = het_count / hom_alt_count
    else:
        results[ind] = None  # 没有非参考纯合位点

# 保存结果到文件
output_file = "HET_ratio.txt"
with open(output_file, "w") as f:
    # 写入文件标题
    f.write("个体ID\t杂合子比率\n")
    for ind, het_ratio in results.items():
        # 如果比率为 None，输出为 "NA"
        het_ratio_str = f"{het_ratio:.4f}" if het_ratio is not None else "NA"
        f.write(f"{ind}\t{het_ratio_str}\n")

# 同时输出到终端以便查看
print("个体ID\t杂合子比率")
for ind, het_ratio in results.items():
    het_ratio_str = f"{het_ratio:.4f}" if het_ratio is not None else "NA"
    print(f"{ind}\t{het_ratio_str}")

计算d-distance以及Heterozygosity_Ratio的计算

我首先要介绍一下d-distance的计算方法

计算杂合子比率

推荐阅读更多精彩内容