Extract high quality variants from VCF files

Here shows a convenient way to extract the high quality variant sites form multiple VCF files using a simple Python script.
The script just screen the whole VCF file, select the high quality variations and generate a new VCF file that only include the high quality variations sites.

The low quality variations as follows were excluded:
  1. variant sites with sequencing depth "lower than min_depth" and "bigger than max_depth"
  2. heterozygous variant sites

Here's the python script:

# -*- coding:utf-8 -*-
import sys

min_depth = sys.argv[1]
max_depth = sys.argv[2]

for line in sys.stdin:
    if line.startswith("#"):
        print(line.strip())
    if line.startswith("Chr"):
        gt = line.split()[9].split(":")[0]
        gt_1 = gt.split("/")[0]
        gt_2 = gt.split("/")[1]
        dp = int(line.split()[9].split(":")[2])
        if gt_1 == gt_2 and dp >= int(min_depth) and dp <= int(max_depth):
            print(line.strip())

Usually the VCF file is in the format of "*.vcf.gz". Let's say your store all your VCF files in one folder, your just need to run the commond line as follows:

for sample in *.vcf.gz; do zcat $sample | python3 select_hq_vcf.py <your_min_depth> <your_max_depth> > sample.hq.vcf; done
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。