Here shows a convenient way to extract the high quality variant sites form multiple VCF files using a simple Python script.
The script just screen the whole VCF file, select the high quality variations and generate a new VCF file that only include the high quality variations sites.
The low quality variations as follows were excluded:
- variant sites with sequencing depth "lower than min_depth" and "bigger than max_depth"
- heterozygous variant sites
Here's the python script:
# -*- coding:utf-8 -*-
import sys
min_depth = sys.argv[1]
max_depth = sys.argv[2]
for line in sys.stdin:
if line.startswith("#"):
print(line.strip())
if line.startswith("Chr"):
gt = line.split()[9].split(":")[0]
gt_1 = gt.split("/")[0]
gt_2 = gt.split("/")[1]
dp = int(line.split()[9].split(":")[2])
if gt_1 == gt_2 and dp >= int(min_depth) and dp <= int(max_depth):
print(line.strip())
Usually the VCF file is in the format of "*.vcf.gz". Let's say your store all your VCF files in one folder, your just need to run the commond line as follows:
for sample in *.vcf.gz; do zcat $sample | python3 select_hq_vcf.py <your_min_depth> <your_max_depth> > sample.hq.vcf; done