一、问题
在按照GATK流程做碱基质量重校正时,需要用到gatk-bundle中下载的一些已知位点vcf文件,如:
- resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf
- resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf
- resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf
但是在运行下述脚本时,提示--known-sites
这里存在错误。
ref_genome_fasta_file=~/database/ref_genome/human/hg38/gatk-bundle/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
dbsnp=~/database/ref_genome/human/hg38/gatk-bundle/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf
Mills_1000G_indel=~/database/ref_genome/human/hg38/gatk-bundle/resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
assembly38_indel=~/database/ref_genome/human/hg38/gatk-bundle/resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf.gz
#### BQSR
echo "Start ${i} BQSR for ${id}." `date`
$GATK --java-options "-Xmx10G -Djava.io.tmpdir=./${dir}/tmp" BaseRecalibrator \
-R ${GENOME} \
-I ./${dir}/${id}_markdup.bam \
--known-sites ${dbsnp} \
--known-sites ${Mills_1000G_indel} \
--known-sites ${assembly38_indel} \
-O ./${dir}/${id}_BaseRecal.table \
1>./${dir}/${id}_BaseRecal.log 2>&1
echo -e "BQSR for ${id} has finised at:" `date` "\n"
二、解决方法
参考该帖后,发现原因是因为gatk-bundle下载的vcf文件是二进制的,需要先用bcftools
处理一下就可以正确运行BQSR了。
bcftools view -O z -o tmp.vcf.gz resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf
mv tmp.vcf.gz resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
bcftools index -t resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
bcftools view -O z -o tmp.vcf.gz resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf
mv tmp.vcf.gz resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf.gz
bcftools index -t resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf.gz