需要的数据文件:ref.fa test_1.fq test_2.fq
- 测序数据质控(具体方法另做总结),由原始测序文件raw_reads.fq得到可以进一步分析的clean_reads.fq(raw.fq--->clean.fq)
- 建索引
#index
bwa index ref.fa
bwa index -a bwtsw -p hg19 hg19.fa & #以人类为例,-p指定索引文件的开头名称
#生成5个文件:.amb,.ann,.bwt,.pac,.sa
samtools faidx ref.fa
gatk CreateSequenceDictionary -R ref.fa -O ref.dict
- 比对+排序+标记重复
这一步由test_1.fq、test_2.fq得到test.sorted.markup.bam
# bwa
bwa mem -t 4 -R '@RG\tID:test\tPL:illumina\tSM:test' ref.fa test_1.fq test_2.fq | samtools view -bS - >test.bam
samtools sort -@ 3 -o test.sorted.bam test.bam
rm test.bam
#markduplicate
Gatk MarkDuplicates -I test.sorted.bam -O test.sorted.markdup.bam -M test.sorted.markdup_metrics.txt3
#-I代表排序后的一个或者多个bam/sam文件,-M 代表输出重复矩阵
#对排序标记重复后的bam文件建立索引
samtools index test.sorted.markup.bam
此外,可收集比对信息、测序深度等
#Collect Alignment & Insert Size Metrics
gatk CollectAlignmentSummaryMetrics R=ref.fa I=test.sorted.markdup.bam O=alignment_metrics.txt
gatk CollectInsertSizeMetrics INPUT=test.sorted.markdup.bam OUTPUT=insert_metrics.txt HISTOGRAM_FILE=insert_size_histogram.pd
samtools depth -a test.sorted.markdup.bam > depth_out.txt
- 碱基质量重矫正
在开始检测变异前先进行碱基质量值重矫正,包含BaseRecalibrator和ApplyBQSR两步,第一步得到一个校准表文件(recal_data.table)
第二步利用recal_data.table重新调整原来BAM文件中的碱基质量值,生成一份新的BAM文件,变异检测就是利用这个新的BAM文件进行。
碱基质量重矫正需要利用一些已经存在的变异信息,所以不同物种的变异检测在碱基质量重矫正上存在差异。对应人类来说现有一些数据库,但其他物种可能没有。这一步由test.sorted.markup.bam得到recal_reads.bam
①对于有相应变异数据库的物种(比如人)来说
#Base Quality Score Recalibration(BQSR)#2
gatk BaseRecalibrator -R ref.fa -I test.sorted.markdup.bam --known-sites dbsnp_138.hg19.vcf -O recal_data.table #这里以dbsnp为例说明
gatk ApplyBQSR -R ref.fa -I test.sorted.markdup.bam -bqsr recal_data.table -O recal_reads.bam #2
#由test.sorted.markup.bam直接得到recal_reads.bam
个人理解,碱基质量重矫正需要提供vcf文件,那么没有变异数据库的物种进行变异检测,我们就想办法创造vcf文件,类似于dbsnp_138.hg19.vcf 这样的,作为 BaseRecalibrator的--known-sites参数。
②对于没有相应变异数据库的物种来说(不知是否可以略过?参考https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/)
#先进行第一次Call Variants得到raw vcf,用raw vcf代替dbsnp_138.hg19.vcf等进行矫正
gatk HaplotypeCaller -R ref.fa -I test.sorted.markdup.bam -O raw_variants.vcf
#Extract SNPs & Indels
gatk SelectVariants -R ref.fa -V raw_variants.vcf -select-type SNP -O raw_snps.vcf
gatk SelectVariants -R ref.fa -V raw_variants.vcf -select-type INDEL -O raw_indels.vcf
#Filter SNPs
gatk VariantFiltration -R ref.fa -V raw_snps.vcf \
-O filtered_snps.vcf \
-filter-name "QD_filter" -filter "QD < 2.0" \
-filter-name "FS_filter" -filter "FS > 60.0" \
-filter-name "MQ_filter" -filter "MQ < 40.0" \
-filter-name "SOR_filter" -filter "SOR > 4.0" \
-filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" \
-filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"
#Filter Indels
gatk VariantFiltration -R ref.fa -V raw_indels.vcf \
-O filtered_indels.vcf \
-filter-name "QD_filter" -filter "QD < 2.0" \
-filter-name "FS_filter" -filter "FS > 200.0" \
-filter-name "SOR_filter" -filter "SOR > 10.0"
#Exclude Filtered Variants
gatk SelectVariants --exclude-filtered -V filtered_snps.vcf -O bqsr_snps.vcf
gatk SelectVariants --exclude-filtered -V filtered_indels.vcf -O bqsr_indels.vcf
#Base Quality Score Recalibration(BQSR)#1
gatk BaseRecalibrator -R ref.fa -I test.sorted.markdup.bam \
--known-sites bqsr_snps.vcf \
--known-sites bqsr_indels.vcf \
-O recal_data.table
#Apply BQSR #1
gatk ApplyBQSR -R ref.fa -I test.sorted.markdup.bam -bqsr recal_data.table -O recal_reads.bam
#先进行变异检测,变异过滤,得到bqsr_snps.vcf和bqsr_indels.vcf,再利用它们进行BaseRecalibrator和ApplyBQSR
#由test.sorted.markdup.bam,中间call一次变异,再进行矫正,最后得到recal_reads.bam
然后开始HaplotypeCaller
- 变异检测(HaplotypeCaller)
这一步由碱基重矫正后的recal_reads.bam生成原始vcf文件raw_variants_recal.vcf
gatk HaplotypeCaller -R ref.fa -I recal_reads.bam -O raw_variants_recal.vcf
- 变异质控和过滤
对原始vcf文件进行变异质控和过滤有2种方法,一是GATK VQSR,二是通过过滤指标进行过滤。VQSR的应用有一定要求,它需要存在相应物种的变异数据库(如1000G、dbsnp等),同时要求新检测的结果中有足够多的变异,适合全基因组分析。
①VQSR
#snp和indels分开校准
#校准snp
gatk VariantRecalibrator -V raw_variants_recal.vcf -O recalibrate_SNP.recal -mode SNP --tranches-file recalibrate_SNP.tranches \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -an QD \
-an FS -an MQRankSum -an ReadPosRankSum -an SOR -an MQ --max-gaussians 6 \
-resource hapmap,known=false,training=true,truth=true,prior=15.0:$hg19_VCF/hapmap_3.3.hg19.sites.vcf \
-resource omini,known=false,training=true,truth=false,prior=12.0:$hg19_VCF/1000G_omni2.5.hg19.sites.vcf \
-resource dbsnp,known=true,training=false,truth=false,prior=6.0:$hg19_VCF/dbsnp_138.hg19.vcf
gatk ApplyVQSR -V raw_variants_recal.vcf -O recalibrated_snps_raw.vcf --recal-file recalibrate_SNP.recal \
--tranches-file recalibrate_SNP.tranches \
-truth-sensitivity-filter-level 99.5 --create-output-variant-index true -mode SNP
#校准indel
gatk VariantRecalibrator -V raw_variants_recal.vcf -O recalibrate_INDEL.recal -mode INDEL --tranches-file recalibrate_INDEL.tranches \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -an QD \
-an FS -an MQRankSum -an ReadPosRankSum -an SOR -an MQ --max-gaussians 6 \
-resource hapmap,known=false,training=true,truth=true,prior=15.0:$hg19_VCF/hapmap_3.3.hg19.sites.vcf \
-resource omini,known=false,training=true,truth=false,prior=12.0:$hg19_VCF/1000G_omni2.5.hg19.sites.vcf \
-resource dbsnp,known=true,training=false,truth=false,prior=6.0:$hg19_VCF/dbsnp_138.hg19.vcf
gatk ApplyVQSR -V raw_variants_recal.vcf -O recalibrated_indel_raw.vcf --recal-file recalibrate_INDEL.recal \
--tranches-file recalibrate_INDEL.tranches \
-truth-sensitivity-filter-level 99.5 --create-output-variant-index true -mode INDEL
VariantRecalibrator和ApplyVQSR用法参照官方,以下贴出可能用到的参数:
USAGE: VariantRecalibrator [arguments]
Build a recalibration model to score variant quality for filtering purposes
Version:4.2.0.0
Required Arguments:
--output,-O <GATKPath> The output recal file used by ApplyVQSR Required.
--resource <FeatureInput> A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run)
This argument must be specified at least once. Required.
--tranches-file <File> The output tranches file used by ApplyVQSR Required.
--use-annotation,-an <String> The names of the annotations which should used for calculations This argument must be specified at least once. Required.
--variant,-V <GATKPath> One or more VCF files containing variants This argument must be specified at least once.Required.
Optional Arguments:
--truth-sensitivity-tranche,-tranche <Double>
The levels of truth sensitivity at which to slice the data. (in percent, that is 1.0 for 1percent)
This argument may be specified 0 or more times. Default value: [100.0, 99.9,99.0, 90.0].
USAGE: ApplyVQSR [arguments]
Apply a score cutoff to filter variants based on a recalibration table
Version:4.2.0.0
Required Arguments:
--output,-O <GATKPath> The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value Required.
--recal-file <FeatureInput> The input recal file used by ApplyVQSR Required.
--variant,-V <GATKPath> One or more VCF files containing variants This argument must be specified at least once.Required.
②通过过滤指标过滤
gatk SelectVariants -R ref.fa -V raw_variants_recal.vcf -select-type SNP -O raw_snps_recal.vcf
gatk SelectVariants -R ref.fa -V raw_variants_recal.vcf -select-type INDEL -O raw_indels_recal.vcf
gatk VariantFiltration -R ref.fa -V raw_snps_recal.vcf \
-O filtered_snps_final.vcf \
-filter-name "QD_filter" -filter "QD < 2.0" \
-filter-name "FS_filter" -filter "FS > 60.0" \
-filter-name "MQ_filter" -filter "MQ < 40.0" \
-filter-name "SOR_filter" -filter "SOR > 4.0" \
-filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" \
-filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"
gatk VariantFiltration -R ref.fa -V raw_indels_recal.vcf \
-O filtered_indels_final.vcf \
-filter-name "QD_filter" -filter "QD < 2.0" \
-filter-name "FS_filter" -filter "FS > 200.0" \
-filter-name "SOR_filter" -filter "SOR > 10.0"
至此,得到了filtered_snps_final.vcf和filtered_indels_final.vcf。
参考:
1.https://gatk.broadinstitute.org/hc/en-us
2.https://www.jianshu.com/p/c41e8f3266b4
3.https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/