1、MutsigCV
Mutsig代表的就是"mutation Significance",简单来说就是把所有的tumor样本集合起来,算他们的变异,算出一个显著性的阈值,超过阈值的即为显著变异。
其中CV代表covariants,包括了DNA复制时间,染色质开放程度,转录活性。
软件运行方法可以参考如下:
CGA: mutsig
生信菜鸟团
简书:MutSigCV找DriverGene
简书:泛癌研究
我用的脚本是官网推荐的,由于没有matlab lisence就用的free MCR,除了自己的maf文件剩下的是下载文件。
maf文件的生成可以参考maftools教程,简单说一下,将分析得到的vcf文件用annovar软件注释后,生成maf,用作MutSig分析的话还需要用maftools将基因名转化一下。简单的脚本如下:
library(maftools)
maffile <- read.maf(maf = maffile)
mafcorrect <- prepareMutSig(maf = maffile)
run_MutSigCV.sh <path_to_MCR> my_mutations.maf exome_full192.coverage.txt gene.covariates.txt my_results mutation_type_dictionary_file.txt chr_files_hg19
生成结果:<prefix>.sig_genes.txt
有跑出的significant gene信息,每行为一个基因,后面跟着其Q-value,按q-value排序。
2、GISTIC
Broad Institute发布的一款关于somatic copy-number alterations 驱动基因的软件,安装有点费劲,请参考INSTALL.txt,或中文版参考:
简书:GISTIC2.0安装与使用
生信菜鸟团:用GISTIC多个segment文件来找SCNA变异
输入文件:
1、segmentation file (-seg)(REQUIRED)
我用的seg文件来源于上游cnvkit分析完的结果导出的seg文件,共六列,每列的结果如下:
The column headers are:
(1) Sample (sample name)
(2) Chromosome (chromosome number)
(3) Start Position (segment start position, in bases)
(4) End Position (segment end position, in bases)
(5) Num Markers (number of markers in segment)
(6) Seg.CN (log2() -1 of copy number)
2、 Markers File (-mk)(optional)
The markers file identifies the marker positions used in the original dataset (before segmentation) for array or capture experiments.
3、Reference Genome File (-refgene)(REQUIRED)
GISTIC安装的时候refgenefiles/文件夹下有提供Reference genome files created in MatlabTM,mat格式,不可查看,根据自己用的参考基因组版本选择。
4、Array List File (-alf)(optional)
首行内容为“array”,接下来每行是一个sample名,指定了分析用的sample子集。
5、CNV File (-cnv)(optional)
该文件是为了排除germline CNV。
输出文件:
1、All Lesions File (all_lesions.conf_XX.txt, where XX is the confidence level)
该文件总结了GISTIC分析的所有结果,包括region、p值、每个样本的
Region Data
Columns 1-9 present the data about the significant regions as follows:
(1) Unique Name: A name assigned to identify the region
(2) Descriptor: The genomic descriptor of that region.
(3) Wide Peak Limits: The "wide peak" boundaries most likely to contain the targeted genes. These are listed in genomic coordinates and marker (or probe) indices.
(4) Peak Limits: The boundaries of the region of maximal amplification or deletion.
(5) Region Limits: The boundaries of the entire significant region of amplification or deletion.
(6) q-values: The q-value of the peak region.
(7) Residual q-values: The q-value of the peak region after removing ("peeling off") amplifications or deletions that overlap other, more significant peak regions in the same chromosome.
(8) Broad or Focal: Identifies whether the region reaches significance due primarily to broad events (called "broad"), focal events (called "focal"), or independently significant broad and focal events (called "both").
(9) Amplitude Threshold: Key giving the meaning of values in the subsequent columns associated with each sample.
Sample Data
Each of the analyzed samples is represented in one of the columns following the lesion data (columns 10 through end). The data contained in these columns varies slightly by section of the file.
A '0' indicates that the copy number of the sample was not amplified or deleted beyond the threshold amount in that peak region. A '1' indicates that the sample had low-level copy number aberrations (exceeding the low threshold indicated in column 9), and a '2' indicates that the sample had high-level copy number aberrations (exceeding the high threshold indicated in column 9).
2、Amplification/Deletion Genes File (amp(/del)_genes.conf_XX.txt, where XX is the confidence level)
每列是一个amp或del,每列有四个信息:cytoband,q值,boundaries,相关的基因(不含基因的peak显示其附近的基因,用[]表示)
The amp genes file contains one column for each amplification peak identified in the GISTIC analysis. The first four rows are:
(1) cytoband
(2) q-value
(3) residual q-value
(4) wide peak boundaries
3、Gistic Scores File (scores.gistic)
The scores file lists the q-values [presented as -log10(q)], G-scores, average amplitudes among aberrant samples, and frequency of aberration, across the genome for both amplifications and deletions.