一、snpeff
对vcf文件增加新字段注释ANN
二、安装snpeff
使用wget下载
$ wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
如果您更喜欢使用'curl'而不是'wget',则可以输入:
curl -L http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip> snpEff_latest_core.zip
三、对vcf进行注释
1.准备参考基因组
如果使用snpeff自带的数据库的参考基因组文件,用下列语句:
java -Xmx4g -jar snpEff.jar GRCh37.75 examples / test.chr22.vcf> test.chr22.ann.vcf
如果是使用自己下载的参考基因组,需要在config文件中添加
需要在下载snpeff的文件下,创建data/参考基因组名称 eg.data/R498
在文件夹下放置以下两个文件
在data文件夹下需创建genomes文件夹,在该文件夹下软连参考基因组
在存放snpeff.jar的文件夹下,运行
java -jar /TJPROJ5/GB_PAG/USER/sunhonglei/snpeff/snpEff/snpEff.jar build -gff3 -v R498 -c /TJPROJ5/GB_PAG/USER/sunhonglei/snpeff/snpEff/snpEff.config
注意:java需要1.7版本以上,输入文件可以是压缩的.gz文件,但输出是未压缩文件
2.运行脚本
/TJPROJ2/GB/PUBLIC/source/GB_HUMAN/guonei/disease_v1.0/Software/java/jdk1.8.0_51/bin/java -Xmx4G -jar /TJPROJ5/GB_PAG/USER/sunhonglei/snpeff/snpEff/snpEff.jar \
-c /TJPROJ5/GB_PAG/USER/sunhonglei/snpeff/snpEff/snpEff.config \
R498 \
SNP.vcf >1.vcf
3.产生结果文件
产生3个结果文件:snpEff_genes.txt,snpEff_summary.html,1.vcf
3.1 1.vcf
原vcf:
##reference=file:///TJPROJ5/GB_PAG/USER/sunhonglei/bsa-report/onlys/02.varDetect/02.SNP/ref/R498_Chr.fa
##source=SelectVariants
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Cell_death no_cell_death
Chr1 13141 . C G 750.20 PASS AC=4;AF=1.00;AN=4;DP=22;Dels=0.00;ExcessHet=3.0103;FS=0.000;HaplotypeScore=0.0000;MLEAC=4;MLEAF=1.00;MQ=40.05;MQ0=0;QD=34.10;SOR=1.931 GT:AD:DP:GQ:PL 1/1:0,10:10:27:320,27,0 1/1:0,12:12:36:457,36,0
snpeff注释后的vcf:
##reference=file:///TJPROJ5/GB_PAG/USER/sunhonglei/bsa-report/onlys/02.varDetect/02.SNP/ref/R498_Chr.fa------------------------参考基因组
##source=SelectVariants
##SnpEffVersion="4.3t (build 2017-11-24 10:18), by Pablo Cingolani"
##SnpEffCmd="SnpEff R498 SNP.vcf "--------------------------------------输入文件
###增加注释的ANN
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Cell_death no_cell_death
Chr1 13141 . C G 750.2 PASS AC=4;AF=1.00;AN=4;DP=22;Dels=0.00;ExcessHet=3.0103;FS=0.000;HaplotypeScore=0.0000;MLEAC=4;MLEAF=1.00;MQ=40.05;MQ0=0;QD=34.10;SOR=1.931;ANN=G|upstream_gene_variant|MODIFIER|OsR498G0100000400.01|OsR498G0100000400.01|transcript|OsR498G0100000400.01.T01|protein_coding||c.-426G>C|||||59|WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS,G|upstream_gene_variant|MODIFIER|OsR498G0100000700.01|OsR498G0100000700.01|transcript|OsR498G0100000700.01.T01|protein_coding||c.-3900C>G|||||3363|WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS,G|upstream_gene_variant|MODIFIER|OsR498G0100000700.01|OsR498G0100000700.01|transcript|OsR498G0100000700.01.T05|protein_coding||c.-3900C>G|||||3363|,G|upstream_gene_variant|MODIFIER|OsR498G0100000700.01|OsR498G0100000700.01|transcript|OsR498G0100000700.01.T04|protein_coding||c.-3900C>G|||||3363|WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS,G|upstream_gene_variant|MODIFIER|OsR498G0100000700.01|OsR498G0100000700.01|transcript|OsR498G0100000700.01.T03|protein_coding||c.-3900C>G|||||3363|WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS,G|upstream_gene_variant|MODIFIER|OsR498G0100000700.01|OsR498G0100000700.01|transcript|OsR498G0100000700.01.T02|protein_coding||c.-3900C>G|||||3363|WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS,G|intergenic_region|MODIFIER|OsR498G0100000400.01-OsR498G0100000700.01|OsR498G0100000400.01-OsR498G0100000700.01|intergenic_region|OsR498G0100000400.01-OsR498G0100000700.01|||n.13141C>G|||||| GT:AD:DP:GQ:PL 1/1:0,10:10:27:320,27,0 1/1:0,12:12:36:457,36,0
注释后的每个字段:
a. Allele (or ALT)等位基因: In case of multiple ALT fields, this helps to identify which ALT we are referring to.
b. Annotation (a.k.a. effect)注释: Annotated using Sequence Ontology terms. Multiple effects can be concatenated using ‘&’.
#CHROM POS ID REF ALT QUAL FILTER INFO chr1 123456。CA。。ANN = A | intron_variant&nc_transcript_variant | ...
c. Putative_impact假设影响: A simple estimation of putative impact / deleteriousness : {HIGH, MODERATE, LOW, MODIFIER}-------影响/有害性的一个简单估算
d. Gene Name基因名称: Common gene name (HGNC). Optional: use closest gene when the variant is “intergenic”.当variant位于基因间时,选择最近的基因名称
e. Gene ID: Gene ID
f. Feature type特征类型: Which type of feature is in the next field下一个字段中的特征类型 (e.g. transcript, motif, miRNA, etc.). It is preferred to use Sequence Ontology (SO) terms, but ‘custom’ (user defined) are allowed. ANN=A|stop_gained|HIGH|||transcript|... Tissue specific features may include cell type / tissue information separated by semicolon e.g.: ANN=A|histone_binding_site|LOW|||H3K4me3:HeLa-S3|...
g. Feature ID: Depending on the annotation根据注释, this may be: Transcript ID转录本ID (preferably using version number最好用版本号), Motif ID, miRNA, ChipSeq peak, Histone mark组蛋白标记, etc. Note: Some features may not have ID (e.g. histone marks from custom Chip-Seq experiments may not have a unique ID).
h.Transcript biotype转录本的生物型: The bare minimum is at least a description on whether the transcript is {“Coding”, “Noncoding”}. Whenever possible, use ENSEMBL biotypes.
i. Rank / total等级/总数:: Exon or Intron rank / total number of exons or introns.外显子或内含子等级/外显子或内含子的总数。
j. HGVS.c: Variant using HGVS notation (DNA level)使用HGVS注释的变异(DNA级别)。
k. HGVS.p: If variant is coding, this field describes the variant using HGVS notation (Protein level). Since transcript ID is already mentioned in ‘feature ID’, it may be omitted here.如果变异是编码,则此字段使用HGVS注释(蛋白质水平)描述变异。由于“ transcript ID”中已经提到了 feature ID,因此在此可以省略。
l. cDNA_position / cDNA_len: Position in cDNA and trancript’s cDNA length (one based).cDNA和转录子的cDNA长度(基于1)
m. CDS_position / CDS_len: Position and number of coding bases (one based includes START and STOP codons).编码碱基的位置和数量(一个碱基包括起始和终止密码子)。
n. Protein_position / Protein_len: Position and number of AA (one based, including START, but not STOP).氨基酸的位置和数量(一个基于AA,包括START,但不包括STOP)。
p. Distance to feature: All items in this field are options, so the field could be empty. 此字段中的所有项目均为选项,因此该字段可能为空。Up/Downstream: Distance to first / last codon 上游/下游:距第一个/最后一个密码子的距离
Intergenic: Distance to closest gene 基因间:距最近基因的距离
Distance to closest Intron boundary in exon (+/- up/downstream)在外显子中最靠近内含子边界的距离(+/-上游/下游)。. If same, use positive number.如果相同,则使用正数。
Distance to closest exon boundary in Intron (+/- up/downstream) 在内含子中最靠近的外显子边界的距离(+/-上/下游)
Distance to first base in MOTIF Distance to first base in miRNA Distance to exon-intron boundary in splice_site or splice _region
ChipSeq peak: Distance to summit (or peak center)
Histone mark / Histone state: Distance to summit (or peak center).距峰顶(或峰顶中心)的距离
q. Errors, Warnings or Information messages: Add errors, warnings or informative message that can affect annotation accuracy. It can be added using either ‘codes’ (as shown in column 1, e.g. W1) or ‘message types’ (as shown in column 2, e.g. WARNING_REF_DOES_NOT_MATCH_GENOME). All these errors, warnings or information messages messages are optional.
3.2 snpEff_genes.txt
GeneName--基因名称(通常是HUGO)
GeneId--基因的ID
Transcript Id--转录ID
BioType-- 转录的生物类型(如果有) 每种影响{高,中,低,修改}都会重复以下列
variants_impact_ *--计算每个影响类别的变异数量对于每个带注释的效果(例如,missense_variant,synonymous_variant,stop_lost等)重复以下列
variants_effect_ *--计算每种效果类型的变异数量以下列重复了几个基因组区域(DOWNSTREAM,EXON,INTRON,UPSTREAM等)
bases_affected_ *--变异与基因组区域重叠的碱基数
total_score_*--与该基因组区域重叠的分数总和。注意:分数仅在输入文件为“ BED”类型时可用(例如,在注释ChipSeq实验时)
length_*--基因组区域长度
3.3 snpEff_summary.html
summary:基因组名称,日期,snpeff版本,输入文件行数,变异的数量,基因组总长度,基因组有效长度,变异率:每218个碱基一个变异。
variants rate details :染色体,长度,变异数量,变异率。
Number variants by type:变异类型的数量。
Number of effects by impact:具有作用的变异数量?
Number of effects by functional class:按功能类别划分的效果数
四、统计基因组区域中bam的reads数
java -Xmx4g -jar snpEff.jar count GRCh37.68 readsFile_1.bam readsFile_2.bam ... readsFile_N.bam > countReads.txt
chr start end type IDs Reads:readsFile_1.bam Bases:readsFile_1.bam Reads:readsFile_2.bam Bases:readsFile_2.bam ...
1 1 11873 Intergenic DDX11L1 130 6631 50 2544
1 1 249250621 Chromosome 1 2527754 251120400 2969569 328173439
五、snpeff中脚本的功能
1. sam2fastq.pl 将SAM输入转换为FASTQ输出
samtools view test.bam | ./scripts/bam2fastq.pl | head -n 12
2.fasta2tab.pl将fasta文件转换为两列制表符分隔的TXT文件(名称\ t序列)示例(为简洁起见,输出被截断)
$ zcat ce6.fa.gz | ./scripts/fasta2tab.plchrI gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcct ...chrV GAATTcctaagcctaagcctaagcctaagcctaagcctaagcctaa ...
3.fastaSplit.pl将多序列FASTA文件拆分为单个文件。示例:每条染色体创建一个文件
$ zcat ce6.fa.gz | ./scripts/fastaSplit.pl 写入chrI.fa 写入chrV.fa
4.queue.pl根据本地计算机中可用的CPU数量并行处理语句列表
5.splitChr.pl按染色体拆分文件。适用于第一列为CHR字段的任何制表符分隔文件(例如BED,VCF等)
六、snpeff参数
CommandMeaning
[eff|ann]:This is the default command. It is used for annotating variant files (e.g. VCF files).
build: Build a SnpEff database from reference genome files (FASTA, GTF, etc.).根据考基因组文件建立一个snpeff的数据库。
buildNextProt :Build NextProt database using XML files
cds :Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness (invoked automatically when building a database).比较从SnpEff数据库计算出的CDS序列与FASTA文件中的CDS序列。用于检查数据库的正确性。
closest :Annotate the closest genomic region.最近:注释最近的基因组区域。
count:Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval.计算每个基因组间隔重叠多少间隔(来自BAM,BED或VCF文件)。
databases:Show currently available databases (from local config file).显示当前可用的数据库(来自本地配置文件)。
download:Download a SnpEff database.下载一个snpeff数据库。
dump: Dump to STDOUT a SnpEff database (mostly used for debugging).转储到STDOUT SnpEff数据库(主要用于调试)。
genes2bed: Create a bed file from a genes list.从基因列表创建一个bed文件。
len: Calculate total genomic length for each marker type.计算每种标记物类型的总基因组长度。
protein:Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. (invoked automatically when building a database).将通过SnpEff数据库计算出的蛋白质序列与FASTA文件中的蛋白质序列进行比较。用于检查数据库的正确性。
spliceAnalysis:对接头位点进行分析。实验功能。
七、eff输出文件
EFF Sub-fieldMeaning
Effect:Effect of this variant. See details here.
Effect impact:Effect impact {High, Moderate, Low, Modifier}. See details here.
Functional Class:Functional class {NONE, SILENT, MISSENSE, NONSENSE}.
Codon_Change / DistanceCodon change: old_codon/new_codon OR distance to transcript (in case of upstream / downstream)
Amino_Acid_ChangeAmino acid change: old_AA AA_position/new_AA (e.g. 'E30K')
Amino_Acid_Length:Length of protein in amino acids (actually, transcription length divided by 3).
Gene_Name:Gene name
Transcript_Bio:TypeTranscript bioType, if available.
Gene_Coding:[CODING | NON_CODING]. This field is 'CODING' if any transcript of the gene is marked as protein coding.
Transcript_ID:Transcript ID (usually ENSEMBL IDs)
Exon/Intron Rank:Exon rank or Intron rank (e.g. '1' for the first exon, '2' for the second exon, etc.)
Genotype_Number:Genotype number corresponding to this effect (e.g. '2' if the effect corresponds to the second ALT)
其他
1.在snpeff数据库中搜索musculus的下载网址
$ java -jar snpEff.jar databases | grep -i musculus
2.利用-d参数进行调试
$ java -Xmx4G -jar snpEff.jar -d -v -canon GRCh37.75 test.vcf
3.筛选列表中的变异,my_transcripts.txt每行一个转录ID
$ java -Xmx4G -jar snpEff.jar -onlyTr my_transcripts.txt GRCh37.75 test.chr22.vcf > test.chr22.ann.vcf
4.可以计算lof(Loss of function)和NMD(nonsense-mediated decay)预测
#注意:从4.0版开始,不需要'-lof'命令行选项 java –Xmx4g -jar snpEff.jar –v \ -lof \ GRCh37.75 \ test.chr22.vcf> test.chr22.ann.vcf
5.注释bed文件
$ java -Xmx4g -jar snpEff.jar -i bed BDGP5.69 chipSeq_peaks.bed
6.筛选bed间隔区间的变异,interval.bed可以自己提供(-fi 参数)
$ java -Xmx4G -jar snpEff.jar -fi interval.bed GRCh38.76 test.chr22.vcf
7.经过测试,snpeff软件可以对samtools、GATK和sentieon的vcf进行注释。