GATK 4.1.9.0 CreateSomaticPanelOfNormals 报错及解决办法

使用GATK 4.1.9.0 分析单个个体（具有一个或多个肿瘤样本）体细胞变异Somatic short variant discovery (SNVs + Indels)，有或无正常对照样本。

Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.

1. 按照官网推荐GATK4最佳实践流程分析肿瘤样本体细胞变异(How to) Call somatic mutations using GATK4 Mutect2。

how to call somatic mutations using GATK4 Mutect2.png

2. GATK 4.1.9.0 分析体细胞变异主要分为两步：第一步，使用GATK4 Mutect2软件分析候选体细胞变异；第二步，对候选体细胞变异进行过滤。

GATK Mutect2 calling SNVs+Indels.png

官网流程代码

## step 1: Call candidate variants
# 配对样本
gatk Mutect2 \
    -R ref.fasta \
    -L intervals.interval_list \
    -I tumor.bam -tumor tumor\
    -I normal.bam -norma normal\
    -germline-resource af-only-gnomad.vcf \
    -pon panel_of_normals.vcf \
    --f1r2-tar-gz f1r2.tar.gz \
    -O unfiltered.vcf

# 仅有肿瘤样本
gatk Mutect2 \
    -R ref.fasta \
    -L intervals.interval_list \
    -I tumor.bam -tumor tumor\
    -germline-resource af-only-gnomad.vcf \
    -pon panel_of_normals.vcf \
    --f1r2-tar-gz f1r2.tar.gz \
    -O unfiltered.vcf

## step 2: Calculate Contamination
gatk GetPileupSummaries \
    -I tumor.bam \
    -V chr17_small_exac_common_3_grch38.vcf.gz \
    -L chr17_small_exac_common_3_grch38.vcf.gz \
    -O tumor.getpileupsummaries.table

gatk GetPileupSummaries \
    -I normal.bam \
    -V chr17_small_exac_common_3_grch38.vcf.gz \
    -L chr17_small_exac_common_3_grch38.vcf.gz \
    -O normal.getpileupsummaries.table
    
gatk CalculateContamination \
    -I tumor.getpileupsummaries.table \
    -matched normal.getpileupsummaries.table
    -O calculatecontamination.table

## step 3: Learn Orientation Bias Artifacts
gatk LearnReadOrientationModel \
    -I f1r2.tar.gz \
    -O read-orientation-model.tar.gz

## step 4: Filter Variants
gatk FilterMutectCalls \
    -V unfiltered.vcf \
    --contamination-table calculatecontamination.table \
    --ob-priors read-orientation-model.tar.gz \
    -O filtered.vcf

3. 在使用GATK4 Mutect2软件分析候选体细胞变异时，最好使用Panel of Normals (PON)文件，以提高变异分析准确度。

GATK开发团队认为，在生成PON文件时，正常对照样本越多越好（至少40个），但是其实一些小样本（4-10个）也是可以的，有总比没有好。

Panel of Normals.png

4. 官网三步生成Panel of Normals（PON）教程

# Step 1: Run Mutect2 in tumor-only mode for each normal sample:
gatk Mutect2 \
    -R reference.fasta \
    -I normal1.bam \
    --max-mnp-distance 0 \
    -O normal1.vcf.gz

# Step 2: Create a GenomicsDB from the normal Mutect2 calls:
gatk GenomicsDBImport \
    -R reference.fasta \
    -L intervals.interval_list \
    --genomicsdb-workspace-path pon_db \
    -V normal1.vcf.gz \
    -V normal2.vcf.gz \
    -V normal3.vcf.gz
    
# Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:
gatk CreateSomaticPanelOfNormals \
    -R reference.fasta \
    --germline-resource af-only-gnomad.vcf.gz \
    -V gendb://pon_db \
    -O pon.vcf.gz

5. 本次使用Agilent_v5 + UTR WES数据（A是肿瘤样本，B是正常对照）构建PON

agilent v5.png

6. 经过数据前处理得到bqsr.bam文件

BQSR1.bam.png

B.bqsr.bam.png

7. step1_pon代码

#!/usr/bash

# Working directory
rawdata_dir=$HOME/wes/rawdata
cleandata_dir=$HOME/wes/cleandata
align_dir=$HOME/wes/align
variation_dir=$HOME/wes/variation
annotation_dir=$HOME/wes/annotation

# Software and Reference
INDEX=/opt/reference/gatk_hg19_index/gatk_hg19
GATK=/opt/software/gatk-4.1.9.0/gatk
REF=/opt/reference/gatk_bundle_hg19/ucsc.hg19.fasta
BED=/opt/reference/Exome_Agilent_V5/Exome-Agilent_V5chr_Clinical.bed
INTERVAL_LIST=/opt/reference/Exome_Agilent_V5/Agilent_V5chr_Clinical.interval_list
gnomad_b37_vcf=/opt/reference/gatk_bundle_hg19/Mutect2/af-only-gnomad.raw.sites.b37.vcf

#### Panel of Normals
# step1_pon
# -L ${BED} \
# --native-pair-hmm-threads 16 \
# --independent-mates true \

ls $align_dir/*B.bqsr.bam | while read id
do
    sample=$(basename ${id} .bqsr.bam)
    echo ${id} ${sample}
    
    echo "start step1_pon for ${sample}" `date`
    $GATK --java-options "-Xmx100G -Djava.io.tmpdir=./" Mutect2 \
    -R ${REF} \
    -I ${id} \
    --max-mnp-distance 0 \
    -O $variation_dir/pon/${sample}.vcf.gz \
    1>$variation_dir/pon/step1_pon.log 2>&1
    echo "end step1_pon for ${sample}" `date`
done

B.vcf.gz.jpg

8. step2_pon 代码

tmp=""
for i in $(ls $variation_dir/pon/*.vcf.gz)
do
    argv="$tmp -V $i"
    tmp=$argv
done
    echo ${argv}
    
    echo "start step2_pon" `date`
    $GATK --java-options "-Xmx100G -Djava.io.tmpdir=./" GenomicsDBImport \
    -R ${REF} \
    -L ${BED} \
    --reader-threads 12 \
    --genomicsdb-workspace-path $variation_dir/pon/pon_db \
    ${argv} \
    1>$variation_dir/pon/step2_pon.log 2>&1
    echo "end step2_pon"  `date`

8.1 step2_pon 结果：会产生大量的 *.json文件

step2_pon_loader.file.png

8.2 step2_pon 结果：会产生大量的 chr$文件

step2_pon_chr.file.png

8.3 CUP不稳定

cup2.png

cpu3.png

cpu1.png

我是10T的空间，跑完step2_pon大概消耗7T空间，再做step3 的时候报错，空间不足，无法执行程序。

9. 分析原因：PON第二步 "-L intervals.interval_list"选项需要后缀.interval_list文件，使用bed文件代替，程序运行错误。因此，通过bed文件制作.interval_list文件

awk '{print $1}' Exome-Agilent_V5chr_Clinical.bed|uniq |awk '{print "-L " $1}'|tr "\n" " " > Agilent_V5chr_Clinical.interval_list

10. 新的流程代码，能够成功运行！

# step2_pon 
    tmp=""
    for i in $(ls $variation_dir/pon/*.vcf.gz)
    do
        argv="$tmp -V $i"
        tmp=$argv
    done
        echo ${argv} 
        
    echo "start step2_pon" `date`
    $GATK --java-options "-Xmx100G -Djava.io.tmpdir=./" GenomicsDBImport \
    -R ${REF} \
    $(cat ${INTERVAL_LIST}) \
    --reader-threads 12 \
    --genomicsdb-workspace-path $variation_dir/pon/pon_db \
    ${argv} \
    1>$variation_dir/pon/step2_pon.log 2>&1
    echo "end step2_pon"  `date`

11. step3_pon 代码，最终生成所需的pon.vcf.gz文件！

# step3_pon
    echo "start step3_pon" `date`
    $GATK --java-options "-Xmx100G -Djava.io.tmpdir=./" CreateSomaticPanelOfNormals \
    -R ${REF} \
    --germline-resource ${gnomad_hg19_vcf_gz} \
    -V gendb://$variation_dir/pon/pon_db \
    --tmp-dir $variation_dir/pon \
    -O $variation_dir/pon/pon.vcf.gz \
    1>$variation_dir/pon/step3_pon.log 2>&1
    echo "end step3_pon" `date`

GATK 4.1.9.0 CreateSomaticPanelOfNormals 报错及解决办法

GATK 4.1.9.0 CreateSomaticPanelOfNormals 报错及解决办法

使用GATK 4.1.9.0 分析单个个体（具有一个或多个肿瘤样本）体细胞变异Somatic short variant discovery (SNVs + Indels)，有或无正常对照样本。

1. 按照官网推荐GATK4最佳实践流程分析肿瘤样本体细胞变异(How to) Call somatic mutations using GATK4 Mutect2。

2. GATK 4.1.9.0 分析体细胞变异主要分为两步：第一步，使用GATK4 Mutect2软件分析候选体细胞变异；第二步，对候选体细胞变异进行过滤。

官网流程代码

3. 在使用GATK4 Mutect2软件分析候选体细胞变异时，最好使用Panel of Normals (PON)文件，以提高变异分析准确度。

4. 官网三步生成Panel of Normals（PON）教程

5. 本次使用Agilent_v5 + UTR WES数据（A是肿瘤样本，B是正常对照）构建PON

6. 经过数据前处理得到bqsr.bam文件

7. step1_pon代码

8. step2_pon 代码

8.1 step2_pon 结果：会产生大量的 *.json文件

8.2 step2_pon 结果：会产生大量的 chr$文件

8.3 CUP不稳定

我是10T的空间，跑完step2_pon大概消耗7T空间，再做step3 的时候报错，空间不足，无法执行程序。

9. 分析原因：PON第二步 "-L intervals.interval_list"选项需要后缀.interval_list文件，使用bed文件代替，程序运行错误。因此，通过bed文件制作.interval_list文件

10. 新的流程代码，能够成功运行！

11. step3_pon 代码，最终生成所需的pon.vcf.gz文件！

相关阅读更多精彩内容

友情链接更多精彩内容

GATK 4.1.9.0 CreateSomaticPanelOfNormals 报错及解决办法

使用GATK 4.1.9.0 分析单个个体（具有一个或多个肿瘤样本）体细胞变异Somatic short variant discovery (SNVs + Indels)，有或无正常对照样本。

1. 按照官网推荐GATK4最佳实践流程分析肿瘤样本体细胞变异(How to) Call somatic mutations using GATK4 Mutect2。

2. GATK 4.1.9.0 分析体细胞变异主要分为两步：第一步，使用GATK4 Mutect2软件分析候选体细胞变异；第二步，对候选体细胞变异进行过滤。

官网流程代码

3. 在使用GATK4 Mutect2软件分析候选体细胞变异时，最好使用Panel of Normals (PON)文件，以提高变异分析准确度。

4. 官网三步生成Panel of Normals（PON）教程

5. 本次使用Agilent_v5 + UTR WES数据（A是肿瘤样本，B是正常对照）构建PON

6. 经过数据前处理得到bqsr.bam文件

7. step1_pon代码

8. step2_pon 代码

8.1 step2_pon 结果：会产生大量 的 *.json文件

8.2 step2_pon 结果：会产生大量 的 chr$文件

8.3 CUP不稳定

我是10T的空间，跑完step2_pon大概消耗7T空间，再做step3 的时候报错，空间不足，无法执行程序。

9. 分析原因：PON第二步 "-L intervals.interval_list"选项需要后缀.interval_list文件，使用bed文件代替，程序运行错误。因此，通过bed文件制作.interval_list文件

10. 新的流程代码，能够成功运行！

11. step3_pon 代码，最终生成所需的pon.vcf.gz文件！

相关阅读更多精彩内容

友情链接更多精彩内容

8.1 step2_pon 结果：会产生大量的 *.json文件

8.2 step2_pon 结果：会产生大量的 chr$文件