实操一个WGS项目,从原始测序数据到比对,再到call变异,以及vcf文件的质控。
软件准备
- samtools
- sratoolskit
- bcftools
- htslib(bcftools源码下载后目录下会自带对应版本的htslib)
- bgzip/tabix(安装htslib后即可使用这两个工具)
- bwa
- GATK4(自带Picard)
实操
1. 参考基因组准备
# 下载
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
# 解压重命令
gzip -dc GCF_000005845.2_ASM584v2_genomic.fna.gz >E.coli_K12_MG1655.fa
# 简历索引
samtools faidx E.coli_K12_MG1655.fa
# samtools示例:快速查看基因组中特定序列位置的碱基
samtools faidx E.coli_K12_MG1655.fa NC_000913.3:1000000-1000200
2. 测序raw reads准备
# 下载数据,fastq-dump是sratoolskit中的工具
fastq-dump --split-files SRR1770413
#压缩数据,减少数据量
bgzip -f SRR1770413_1.fastq
bgzip -f SRR1770413_2.fastq
3. 比对
#建立索引
bwa index E.coli_K12_MG1655.fa
#1 比对
bwa mem -t 4 -R '@RG\tID:foo\tPL:illumina\tSM:E.coli_K12' ../input/E.coli_K12_MG1655.fa ../input/SRR1770413_1.fastq.gz ../input/SRR1770413_2.fastq.gz | samtools view -Sb - > ../output/E_coli_K12.bam && echo "** bwa mapping done **"
#2 排序
samtools sort -@ 4 -m 4G -O bam -o ../output/E_coli_K12.sorted.bam ../output/E_coli_K12.bam && echo "** BAM sort done"
rm -f ../output/E_coli_K12.bam
#3 标记PCR重复
gatk MarkDuplicates -I ../output/E_coli_K12.sorted.bam -O ../output/E_coli_K12.sorted.markdup.bam -M ../output/E_coli_K12.sorted.markdup_metrics.txt && echo "** markdup done **"
#4 删除不必要文件(可选)
rm -f ../output/E_coli_K12.bam
rm -f ../output//E_coli_K12.sorted.bam
#5 创建比对索引文件
samtools index ../output/E_coli_K12.sorted.markdup.bam && echo "** index done **"
4.变异检测
# 为参考基因组构建dict,前Picard的功能
gatk CreateSequenceDictionary -R E.coli_K12_MG1655.fa -O E.coli_K12_MG1655.dict && echo "** dict done **"
#1 生成中间文件gvcf
gatk HaplotypeCaller \
-R ../input/E.coli_K12_MG1655.fa \
--emit-ref-confidence GVCF \
-I ../output/E_coli_K12.sorted.markdup.bam \
-O ../output/E_coli_K12.g.vcf && echo "** gvcf done **"
#2 通过gvcf检测变异
gatk GenotypeGVCFs \
-R ../input/E.coli_K12_MG1655.fa \
-V ../output/E_coli_K12.g.vcf \
-O ../output/E_coli_K12.vcf && echo "** vcf done **"
5.vcf预处理
#1 压缩
bgzip -f ../output/E_coli_K12.vcf
#2 构建tabix索引
tabix -p vcf ../output/E_coli_K12.vcf.gz
6.变异过滤
# 使用SelectVariants,选出SNP
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk SelectVariants \
-select-type SNP \
-V ../output/E_coli_K12.vcf.gz \
-O ../output/E_coli_K12.snp.vcf.gz
# 为SNP作硬过滤
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk VariantFiltration \
-V ../output/E_coli_K12.snp.vcf.gz \
--filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filter-name "Filter" \
-O ../output/E_coli_K12.snp.filter.vcf.gz
# 使用SelectVariants,选出Indel
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk SelectVariants \
-select-type INDEL \
-V ../output/E_coli_K12.vcf.gz \
-O ../output/E_coli_K12.indel.vcf.gz
# 为Indel作过滤
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk VariantFiltration \
-V ../output/E_coli_K12.indel.vcf.gz \
--filter-expression "QD < 2.0 || FS > 200.0 || SOR > 10.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filter-name "Filter" \
-O ../output/E_coli_K12.indel.filter.vcf.gz
# 重新合并过滤后的SNP和Indel
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk MergeVcfs \
-I ../output/E_coli_K12.snp.filter.vcf.gz \
-I ../output/E_coli_K12.indel.filter.vcf.gz \
-O ../output/E_coli_K12.filter.vcf.gz
# 删除无用中间文件
rm -f ../output/E_coli_K12.snp.vcf.gz* ../output/E_coli_K12.snp.filter.vcf.gz* ../output/E_coli_K12.indel.vcf.gz* ../output/E_coli_K12.indel.filter.vcf.gz*
结果
输出文件:
image.png
过滤后的vcf文件:
image.png
代码实现只是一个简单过程,其中每一步的原理、软件使用参数及结果的理解才是重点。参考:
GATK4.0和全基因组数据分析实践(上)
GATK4.0和全基因组数据分析实践(下)