了解过vcf文件的格式之后,对亲本纯合且差异的位点进行过滤就变得简单多了。如果格式规范的话一行awk命令其实就能解决。这里为了提高脚本的适用范围,所以写的稍微麻烦了些。
vcf=要过滤的vcf文件
p1=亲本一的样本名(与vcf中的样本名保持一致)
p2=亲本二的样本名(与vcf中的样本名保持一致)
python vcf_filt.py -i vcf -p1 p1 -p2 p2 >filter.vcf
脚本内容如下:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('-vcf', "--vcf",dest = "i", default="none", #metavar=", defining metavar is causing an error for some reason
help = "total vcf"
)
parser.add_argument('-p1', "--parent_1",dest = "P", default="P1", #metavar=", defining metavar is causing an error for some reason
help = "parent bulk 1 name"
)
parser.add_argument('-p2', "--parent_2",dest = "p",default="P2", #metavar=", defining metavar is causing an error for some reason
help = "parent bulk 2 name"
)
args = parser.parse_args()
vcffile=args.i
P1=args.P
P2=args.p
info_dic = {}
sample_dic = {}
with open(vcffile,'r')as vcf:
for line in vcf:
if line.strip() != '' and line[:2] != "##":
if line[:2] == "#C":
for i in line.strip().split("\t"):
sample_dic[i] = line.strip().split("\t").index(i)
print(line.strip())
else:
lst = line.strip().split("\t")
for i in lst[sample_dic["FORMAT"]].split(":"):
info_dic[i] = lst[sample_dic["FORMAT"]].split(":").index(i)
p1_gt = lst[sample_dic[P1]].split(":")[info_dic["GT"]][0] + lst[sample_dic[P1]].split(":")[info_dic["GT"]][2]
p2_gt = lst[sample_dic[P2]].split(":")[info_dic["GT"]][0] + lst[sample_dic[P2]].split(":")[info_dic["GT"]][2]
if (p1_gt == "00" and p2_gt == "11") or (p1_gt == "11" and p2_gt == "00"):
print(line.strip())
else:
print(line.strip())
整理不易,给个好评再走呗!