使用GATK 4.1.9.0 分析单个个体(具有一个或多个肿瘤样本)体细胞变异Somatic short variant discovery (SNVs + Indels),有或无正常对照样本。
Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.
1. 按照官网推荐GATK4最佳实践流程分析肿瘤样本体细胞变异(How to) Call somatic mutations using GATK4 Mutect2。

how to call somatic mutations using GATK4 Mutect2.png
2. GATK 4.1.9.0 分析体细胞变异主要分为两步:第一步,使用GATK4 Mutect2软件分析候选体细胞变异;第二步,对候选体细胞变异进行过滤。

GATK Mutect2 calling SNVs+Indels.png
官网流程代码
## step 1: Call candidate variants
# 配对样本
gatk Mutect2 \
-R ref.fasta \
-L intervals.interval_list \
-I tumor.bam -tumor tumor\
-I normal.bam -norma normal\
-germline-resource af-only-gnomad.vcf \
-pon panel_of_normals.vcf \
--f1r2-tar-gz f1r2.tar.gz \
-O unfiltered.vcf
# 仅有肿瘤样本
gatk Mutect2 \
-R ref.fasta \
-L intervals.interval_list \
-I tumor.bam -tumor tumor\
-germline-resource af-only-gnomad.vcf \
-pon panel_of_normals.vcf \
--f1r2-tar-gz f1r2.tar.gz \
-O unfiltered.vcf
## step 2: Calculate Contamination
gatk GetPileupSummaries \
-I tumor.bam \
-V chr17_small_exac_common_3_grch38.vcf.gz \
-L chr17_small_exac_common_3_grch38.vcf.gz \
-O tumor.getpileupsummaries.table
gatk GetPileupSummaries \
-I normal.bam \
-V chr17_small_exac_common_3_grch38.vcf.gz \
-L chr17_small_exac_common_3_grch38.vcf.gz \
-O normal.getpileupsummaries.table
gatk CalculateContamination \
-I tumor.getpileupsummaries.table \
-matched normal.getpileupsummaries.table
-O calculatecontamination.table
## step 3: Learn Orientation Bias Artifacts
gatk LearnReadOrientationModel \
-I f1r2.tar.gz \
-O read-orientation-model.tar.gz
## step 4: Filter Variants
gatk FilterMutectCalls \
-V unfiltered.vcf \
--contamination-table calculatecontamination.table \
--ob-priors read-orientation-model.tar.gz \
-O filtered.vcf
3. 在使用GATK4 Mutect2软件分析候选体细胞变异时,最好使用Panel of Normals (PON)文件,以提高变异分析准确度。
GATK开发团队认为,在生成PON文件时,正常对照样本越多越好(至少40个),但是其实一些小样本(4-10个)也是可以的,有总比没有好。

Panel of Normals.png
4. 官网三步生成Panel of Normals(PON)教程
# Step 1: Run Mutect2 in tumor-only mode for each normal sample:
gatk Mutect2 \
-R reference.fasta \
-I normal1.bam \
--max-mnp-distance 0 \
-O normal1.vcf.gz
# Step 2: Create a GenomicsDB from the normal Mutect2 calls:
gatk GenomicsDBImport \
-R reference.fasta \
-L intervals.interval_list \
--genomicsdb-workspace-path pon_db \
-V normal1.vcf.gz \
-V normal2.vcf.gz \
-V normal3.vcf.gz
# Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:
gatk CreateSomaticPanelOfNormals \
-R reference.fasta \
--germline-resource af-only-gnomad.vcf.gz \
-V gendb://pon_db \
-O pon.vcf.gz
5. 本次使用Agilent_v5 + UTR WES数据(A是肿瘤样本,B是正常对照)构建PON

agilent v5.png
6. 经过数据前处理得到bqsr.bam文件

BQSR1.bam.png

B.bqsr.bam.png
7. step1_pon代码
#!/usr/bash
# Working directory
rawdata_dir=$HOME/wes/rawdata
cleandata_dir=$HOME/wes/cleandata
align_dir=$HOME/wes/align
variation_dir=$HOME/wes/variation
annotation_dir=$HOME/wes/annotation
# Software and Reference
INDEX=/opt/reference/gatk_hg19_index/gatk_hg19
GATK=/opt/software/gatk-4.1.9.0/gatk
REF=/opt/reference/gatk_bundle_hg19/ucsc.hg19.fasta
BED=/opt/reference/Exome_Agilent_V5/Exome-Agilent_V5chr_Clinical.bed
INTERVAL_LIST=/opt/reference/Exome_Agilent_V5/Agilent_V5chr_Clinical.interval_list
gnomad_b37_vcf=/opt/reference/gatk_bundle_hg19/Mutect2/af-only-gnomad.raw.sites.b37.vcf
#### Panel of Normals
# step1_pon
# -L ${BED} \
# --native-pair-hmm-threads 16 \
# --independent-mates true \
ls $align_dir/*B.bqsr.bam | while read id
do
sample=$(basename ${id} .bqsr.bam)
echo ${id} ${sample}
echo "start step1_pon for ${sample}" `date`
$GATK --java-options "-Xmx100G -Djava.io.tmpdir=./" Mutect2 \
-R ${REF} \
-I ${id} \
--max-mnp-distance 0 \
-O $variation_dir/pon/${sample}.vcf.gz \
1>$variation_dir/pon/step1_pon.log 2>&1
echo "end step1_pon for ${sample}" `date`
done

B.vcf.gz.jpg
8. step2_pon 代码
tmp=""
for i in $(ls $variation_dir/pon/*.vcf.gz)
do
argv="$tmp -V $i"
tmp=$argv
done
echo ${argv}
echo "start step2_pon" `date`
$GATK --java-options "-Xmx100G -Djava.io.tmpdir=./" GenomicsDBImport \
-R ${REF} \
-L ${BED} \
--reader-threads 12 \
--genomicsdb-workspace-path $variation_dir/pon/pon_db \
${argv} \
1>$variation_dir/pon/step2_pon.log 2>&1
echo "end step2_pon" `date`
8.1 step2_pon 结果:会产生大量 的 *.json文件

step2_pon_loader.file.png
8.2 step2_pon 结果:会产生大量 的 chr$文件

step2_pon_chr.file.png
8.3 CUP不稳定

cup2.png

cpu3.png

cpu1.png
我是10T的空间,跑完step2_pon大概消耗7T空间,再做step3 的时候报错,空间不足,无法执行程序。
9. 分析原因:PON第二步 "-L intervals.interval_list"选项需要后缀.interval_list文件,使用bed文件代替,程序运行错误。因此,通过bed文件制作.interval_list文件
awk '{print $1}' Exome-Agilent_V5chr_Clinical.bed|uniq |awk '{print "-L " $1}'|tr "\n" " " > Agilent_V5chr_Clinical.interval_list
10. 新的流程代码,能够成功运行!
# step2_pon
tmp=""
for i in $(ls $variation_dir/pon/*.vcf.gz)
do
argv="$tmp -V $i"
tmp=$argv
done
echo ${argv}
echo "start step2_pon" `date`
$GATK --java-options "-Xmx100G -Djava.io.tmpdir=./" GenomicsDBImport \
-R ${REF} \
$(cat ${INTERVAL_LIST}) \
--reader-threads 12 \
--genomicsdb-workspace-path $variation_dir/pon/pon_db \
${argv} \
1>$variation_dir/pon/step2_pon.log 2>&1
echo "end step2_pon" `date`
11. step3_pon 代码,最终生成所需的pon.vcf.gz文件!
# step3_pon
echo "start step3_pon" `date`
$GATK --java-options "-Xmx100G -Djava.io.tmpdir=./" CreateSomaticPanelOfNormals \
-R ${REF} \
--germline-resource ${gnomad_hg19_vcf_gz} \
-V gendb://$variation_dir/pon/pon_db \
--tmp-dir $variation_dir/pon \
-O $variation_dir/pon/pon.vcf.gz \
1>$variation_dir/pon/step3_pon.log 2>&1
echo "end step3_pon" `date`