上一期,给大家介绍了SnpEff注释数据库。这一期着重介绍SnpEff的命令,最后一期介绍注释结果解析
准备文件
- 已经注释好的物种SnpEff注释库- GRCh37.100 (~/snpeff/genome/GRCh37.100 详细过程参照说明一)
- 需要注释的SNP/INDEL文件,格式VCF (任意文件夹 ~/database/SNP/human_GRCh37.vcf.gz)
🎃1 快速注释的代码很简单,一步搞定
snpeffDir=~/snpeff
snpEff=${snpeffDir}/snpEff.jar
cd ~/database/SNP/
##常规注释
nohup java -Xmx10G -jar $snpEff GRCh37.100 human_GRCh37.vcf.gz > human_GRCh37_snpeff.snp.vcf -csvStats human_GRCh37_snpeff.snp.csv -stats human_GRCh37_snpeff.snp.html &
解说:注释的文件human_GRCh37_snpeff.snp.vcf 有详细信息, human_GRCh37_snpeff.snp.html链接有统计图片,该链接在Microsoft Edge显示图片失败,如果出现这种情况,可以换一个浏览器打开。
🎃2 对特定区间注释
过滤结果的选项(与命令ann配合使用):
-fi , -filterInterval <file> : Only analyze changes that intersect with the intervals specified in this file (you may use this option many times)
-no-downstream : Do not show DOWNSTREAM changes
-no-intergenic : Do not show INTERGENIC changes
-no-intron : Do not show INTRON changes
-no-upstream : Do not show UPSTREAM changes
-no-utr : Do not show 5_PRIME_UTR or 3_PRIME_UTR changes
-no EffectType : Do not show 'EffectType'. This option can be used several times.
#例:展示基因内注释
java -Xmx10G -jar $snpEff ann -no-intron -no-utr -no-downstream -no-upstream -no-intergenic GRCh37.100 human_GRCh37_snpeff.snp.vcf.gz > RNA-H-DL_snpeff.snp.gene.vcf -csvStats human_GRCh37_snpeff.csv -stats human_GRCh37_snpeff.html
注释常规选项解说
Options:
-chr <string> : Prepend 'string' to chromosome name (e.g. 'chr1' instead of '1'). 染色体输出前缀
-classic : Use old style annotations instead of Sequence Ontology and Hgvs. 使用旧的注释格式,现在使用的Sequence Ontology, 新旧示例如下
-download : Download reference genome if not available. Default: true
-i <format> : Input format [ vcf, bed ]. Default: VCF.
-fileList : Input actually contains a list of files to process.
-o <format> : Ouput format [ vcf, gatk, bed, bedAnn ]. Default: VCF.
-s , -stats : Name of stats file (summary). Default is 'snpEff_summary.html'
-noStats : Do not create stats (summary) file
-csvStats : Create CSV summary file instead of HTML
常用选项-chr,-classic,-csvStats
-classic
Type | Classic |
---|---|
coding_sequence_variant | CDS |
chromosome | CHROMOSOME_LARGE DELETION |
coding_sequence_variant | CODON_CHANGE |
inframe_insertion | CODON_INSERTION |
disruptive_inframe_insertion | CODON_CHANGE_PLUS CODON_INSERTION |
inframe_deletion | CODON_DELETION |
disruptive_inframe_deletion | CODON_CHANGE_PLUS CODON_DELETION |
downstream_gene_variant | DOWNSTREAM |
exon_variant | EXON |
exon_loss_variant | EXON_DELETED |
frameshift_variant | FRAME_SHIFT |
gene_variant | GENE |
intergenic_region | INTERGENIC |
conserved_intergenic_variant | INTERGENIC_CONSERVED |
intragenic_variant | INTRAGENIC |
intron_variant | INTRON |
conserved_intron_variant | INTRON_CONSERVED |
miRNA | MICRO_RNA |
missense_variant | NON_SYNONYMOUS_CODING |
initiator_codon_variant | NON_SYNONYMOUS_START |
stop_retained_variant | NON_SYNONYMOUS_STOP |
rare_amino_acid_variant | RARE_AMINO_ACID |
splice_acceptor_variant | SPLICE_SITE_ACCEPTOR |
splice_donor_variant | SPLICE_SITE_DONOR |
splice_region_variant | SPLICE_SITE_REGION |
splice_region_variant | SPLICE_SITE_BRANCH |
splice_region_variant | SPLICE_SITE_BRANCH_U12 |
stop_lost | STOP_LOST |
5_prime_UTR_premature start_codon_gain_variant | START_GAINED |
start_lost | START_LOST |
stop_gained | STOP_GAINED |
synonymous_variant | SYNONYMOUS_CODING |
start_retained | SYNONYMOUS_START |
stop_retained_variant | SYNONYMOUS_STOP |
transcript_variant | TRANSCRIPT |
regulatory_region_variant | REGULATION |
upstream_gene_variant | UPSTREAM |
3_prime_UTR_variant | UTR_3_PRIME |
3_prime_UTR_truncation + exon_loss | UTR_3_DELETED |
5_prime_UTR_variant | UTR_5_PRIME |
5_prime_UTR_truncation + exon_loss_variant | UTR_5_DELETED |
部分变异注释:密码子变异(initiator_codon_variant),下游基因变异(downstream_gene_variant),基因间变异(intergenic_region),基因内变异(intragenic_variant),内含子变异(intron_variant),错义突变(missense_variant),非编码转录外显子突变(non_coding_transcript_exon_variant),剪切受体突变(splice_acceptor_variant),剪切供体突变(splice_donor_variant),剪切位点区域变异(splice_region_variant),终止密码子获(stop_gained),终止密码子丢失(stop_lost),终止密码子保留(stop_retained_variant),同义突变(synonymous_variant ),上游基因突变(upstream_gene_variant),5_prime_UTR_premature_start_codon_gain_variant,5_prime_UTR(5_prime_UTR_variant),3_prime_UTR变异(3_prime_UTR_variant)。
🎃3 注释文件的参数设置
Annotations options:
-cancer : Perform 'cancer' comparisons (Somatic vs Germline). Default: false
-cancerSamples <file> : Two column TXT file defining 'original \t derived' samples.
-formatEff : Use 'EFF' field compatible with older versions (instead of 'ANN').
-geneId : Use gene ID instead of gene name (VCF output). Default: false
-hgvs : Use HGVS annotations for amino acid sub-field. Default: true
-lof : Add loss of function (LOF) and Nonsense mediated decay (NMD) tags.
-noHgvs : Do not add HGVS annotations.
-noLof : Do not add LOF and NMD annotations.
-noShiftHgvs : Do not shift variants according to HGVS notation (most 3prime end).
-oicr : Add OICR tag in VCF file. Default: false
-sequenceOntology : Use Sequence Ontology terms. Default: true (跟-classic对应)
🎃4 注释典型转录本 (canonical transcripts)
结果会输出gene name, geneID, trianscriptId, cdsLength。
java -Xmx10G -jar $snpEff -v -canon GRCh37.100 human_GRCh37.vcf.gz > human_GRCh37ann.canon.vcf