综合性突变危害性预测软件

基于测序数据得到的候选变异，如何判定突变是否有害呢？准确区分中性突变与致病突变对遗传病的临床检测有着重要的意义，研究表明，对于单个样本的外显子数据，即使过滤了群体频率（小于1%）与功能，最终仍然有近~400左右的非同义罕见突变位点[1,2]，因此若能对突变进行精确的危害性预测，从大量候选突变中鉴定出致病突变将很大程度辅助临床上对遗传病进行确切诊断及早期干预。

目前已经有多个突变的危害性预测软件开发文章发表，dbNSFP是一个不断更新的对人类非同义突变位点（nsSNVs）注释的工具，目前已收录84,013,490 nsSNVs位点和剪切位点ssSNVs (splicing-site SNVs)。根据最新的dbNSFP v4.0版本，其收录了29个危害性预测软件如SIFT, SIFT4G, Polyphen2-HDIV, Polyphen2-HVAR, LRT, MutationTaster2, MutationAssessor, FATHMM, MetaSVM, MetaLR, CADD, VEST4, PROVEAN, FATHMM-MKL coding, FATHMM-XF coding, fitCons, LINSIGHT, DANN, GenoCanyon, Eigen, Eigen-PC, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, GEOGEN2, ALoFT和9个保守型的软件如PhyloP x 3, phastCons x 3, GERP++, SiPhy, bStatistic。其他的注释信息包括群体频率如千人基因组1000 Genomes Project phase 3 data, 英国万人基因组UK10K cohorts data, ExAC consortium数据, gnomAD data和ESP6500 数据, 还包括其他一些基因水平的注释。dbNSFP可以方便于对位点水平的注释，同时我们也看到目前至少已有超过40多个位点的危害性预测工具。

按照Kai Wang和Xiaoming Liu[3]（也是dbNSFP工具的作者）对危害性预测软件的分类，从预测原理及预测方法上区分，其主要基于：

蛋白质功能的改变：主要是突变引起蛋白质空间构象改变，进一步造成生理功能发生有害的变化，如PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT等。
进化保守性：主要是对多个物种核酸序列或蛋白序列进行多序列比对，分析同源序列的多态性，如GERP++, SiPhy和PhyloP等。
综合性软件：主要是结合多个预测软件的结果，同时收集相关特征信息，利用机器学习等相关算法结合突变的多维特征训练模型进行预测，如CADD, DANN，MetaSVM, MetaLR，CONDEL, M-CAP, REVEL等。

综合性软件由于其结合了多个软件的结果，并基于了一定的算法与特征，因此提升了对突变致病性判断的准确度和灵敏度。近年来许多类似开发的相关软件发表，总结如下：

名称	网站	发表时间	特征/学习	训练集	算法
VEST	http://karchinlab.org/apps/appVest.html	28-May-13	The full set of 86 features for VEST classifier construction.	~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project.	supervised machine learning algorithm, Random Forest
CADD	http://cadd.gs.washington.edu/	2-Feb-14	63 annotations including 949 sequence features	13,141,299 SNVs, 627,071 insertions and 926,968 deletions from both the simulated variant and observed variant data sets.	support vector machine(SVM)
DANN	https://cbcl.ics.uci.edu/public_data/DANN/	22-Oct-14	同CADD	同CADD	deep neural network (DNN).
MetaSVM	doi: 10.1093/hmg/ddu733	22-Dec-14	nine scores (SIFT, PolyPhen-2, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy and PhyloP), along with allele frequency observed in diverse populations of the 1000 Genomes project.	Training dataset included 14 191 deleterious mutations, which were annotated as causing Mendelian disease and 22 001 neutral mutations, which were annotated as not known to be associated with any phenotypes, all based on Uniprot annotation.	support vector machine(SVM)
MetaLR	doi: 10.1093/hmg/ddu733	22-Dec-14	nine scores (SIFT, PolyPhen-2, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy and PhyloP), along with allele frequency observed in diverse populations of the 1000 Genomes project.	Training dataset included 14 191 deleterious mutations, which were annotated as causing Mendelian disease and 22 001 neutral mutations, which were annotated as not known to be associated with any phenotypes, all based on Uniprot annotation.	logistic regression (LR)
Eigen	http://www.columbia.edu/~ii2135/eigen.html	4-Jan-16	protein function scores (SIFT, PolyPhen), and Mutation Assessor. Evolutionary conservation scores (GERP_NR and GERP_RS5); PhyloP primate (PhyloPri), placental mammal (PhyloPla) and vertebrate (PhyloVer). Allele frequencies in four populations (African (1-AF_AFR), European (1-AF_EUR), East Asian (1-AF_ASN) and admixed American (1-AF_AMR)) were obtained from the 1000 Genomes Project (November 2014).	the training data on ~76.7 million coding nonsynonymous variants	an unsupervised approach to integrate these different annotations into one measure of functional importance
IMHOTEP	http://www.uni-kiel.de/medinfo/cgi-bin/predictor/	26-Sep-16	integrated nine popular prediction tools (PolyPhen-2, SNPs&GO, MutPred, SIFT, MutationTaster2, Mutation Assessor and FATHMM as well as conservationbased Grantham Score and PhyloP) into a single predictor.	10 029 disease causing single nucleotide variants (SNVs) from Human Gene Mutation Database and 10 002 putatively‘benign’ non synonymous SNVs from UCSC	random forest,decision tree or logistic regression analysis.
REVEL	https://sites.google.com/site/revelgenomics/	6-Oct-16	a total of 18 individual pathogenicity prediction scores from 13 tools as predictive features. MutPred, FATHMM, VEST, Poly-Phen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons.	Human Gene Mutation Database (HGMD) version 2015.2 and the Exome Sequencing Project (ESP) European-American and African-American populations, the Atherosclerosis Risk in Communities (ARIC) study European-American and African American populations, and the 1000 Genomes Project (KGP) European, Yoruban, and Asian populations. The final training set consisted of 6,182 HGMD disease variants and 123,706 rare neutral ESVs.	Random Forest
M-CAP	http://bejerano.stanford.edu/MCAP/	24-Oct-16	It uses nine established pathogenicity likelihood scores: SIFT,PolyPhen-2, CADD, MutationTaster, MutationAssessor,FATHMM, LRT, MetaLR, and MetaSVM. It also incorporates seven established measures of base-pair, amino acid, genomic region,and gene conservation: RVIS, PhyloP, PhastCons, PAM250, BLOSUM62, SIPHY, and GERP. In addition,M-CAP introduces 298 new features derived from multiple-sequence alignment of 99 primate, mammalian, and vertebrate genomes to the human genome.	HGMD Pro 2015.2(pathogenic) and ExAC v3 (benign)，12,418 rare, missense pathogenic variants and 3,137,919 rare, missense benign variants	gradient boosting tree
DEOGEN2	https://deogen2.mutaframe.com/	26-Apr-17	PROVEAN score,Conservation Index,Mutant/wildtype log-odd ratio,Early Folding predictions New EF EF,PFAM log-odd score New PF PF,Interaction patches annotation New INT IN,RVIS New RVIS RV,GDI New GDI GD,Recessiveness index From version 1 REC RE,Gene essentiality From version 1 ESS ES,Pathway log-odd score	February 2016 version of Humsavar. 27 606 deleterious SNVs and 38 285 neutral SNVs retained.	the scikit-learn implementation of a Random Forest classifier with 200 trees.
MutPred	http://mutpred.mutdb.org/	9-May-17	extracted 1,345 (including 20 optional) features.These features are subcategorized into six groups: (1) sequence-based features, (2) substitution-based features, (3) position-specific scoring matrix-based features, (4) conservationbased features, (5) homolog profiles (optional due to time necessary to compute), and (6) changes in predicted structural and functional properties.	It is trained on a set of 53,180 pathogenic and 206,946 unlabeled (putatively neutral) variants obtained from the Human Gene Mutation Database (HGMD), SwissVar, dbSNP and inter-species pairwise alignment.	a bagged ensemble of 30 feed-forward neural networks
ALoFT	http://aloft.gersteinlab.org/	29-Aug-17	108 features to train model,The main features of ALoFT include (1) functional domain annotations; (2) evolutionary conservation; and (3) biological networks.	used three classes of premature stop variants as training data: benign variants, dominant disease-causing variants, and recessive disease-causing variants. The benign set includes homozygous premature stop variants discovered in a cohort of 1092 healthy people, Phase1 1000 Genomes data (1KG).Homozygous premature stop mutations from HGMD that lead to recessive disease and heterozygous premature stop variants in haplo-insufficient genes that lead to dominant disease represent the two disease classes.	random forest algorithm
MVP	https://github.com/ShenLab/missense	2-Feb-18	38 features used in constrained model, 21 features used in non-constrained model	22,390 missense mutations from Human Gene Mutation Database Pro version 2013 (HGMD) database under the disease mutation (DM) category, 12,875 deleterious variants from UniProt and 4,424 pathogenic variants from ClinVar database as true positive(TP). In total, there are 32,074 unique positive training variants. The negative training sets include 5,190 neutral variants from Uniprot randomly selected 42,415 rare variants from DiscovEHR database, and 39,593 observed human-derived variants. In total, there are 86,620 unique negative training variants	deep residual neural network model (ResNet)
ClinPred	https://sites.google.com/site/clinpred/	13-Sep-18	16 individual prediction scores from SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, LRT, MutationAssessor,PROVEAN, CADD, GERP, DANN, PhastCons, fitCons, PhyloP,and SiPhy.Allele frequencies (AFs) of each variant in different populations were obtained from the gnomAD database	ClinVar database dated January 2016；11,082 variants, with 7,059 labeled as benign and 4,023 labeled as pathogenic	random forest (cforest) and gradient boosted decision tree (xgboost)
PrimateAI	https://github.com/Illumina/PrimateAI	17-Dec-18	The total size of the network, with protein structure included, is 36 layers of convolutions, consisting of roughly 400,000 trainable parameters	Exome Aggregation Consortium (ExAC) and Genome Aggregation Database (gnomAD);~380,000 common missense variants from humans and six non-human primate species, using a semi-supervised benign vs unlabeled training regimen	deep neural networks

从上述总结中，可发现综合性软件的开发从传统的机器学习算法到现在比较火的深度学习应用上，每年都会有新的软件基于不同的特征与训练集开发的软件报道；同时我们也可看出对于危害性预测软件，其准确性都有着一定的波动性，目前也有许多文章评测了各种软件的效果[4,5,6]，这种准确性波动的原因可能受到位点异质性的影响，为了降低这种异质性，提升危害性预测软件的准确性，以更为具体的疾病，基因或通路信息研究是目前危害性预测软件提升的一个方向，下节将分享一篇最新发表的疾病特异性的预测软件。无论如何，在使用这类软件时需注意，根据ACMG遗传变异分类标准与指南，“在解读中，不同软件工具组合的预测结果被视为单一证据而不是相互独立的证据。因为每个软件工具基于他们使用的算法都各有优缺点，所以仍然建议使用多种软件进行序列变异解读; 很多情况下，预测性可能因为基因和蛋白质序列的不同而有差异。无论如何，这些软件分析结果只是预测，他们在序列变异解读中的应用应该慎重。不建议仅使用这些预测结果作为唯一证据来源进行临床判断”。

参考文献

Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., and McVean, G.A.; 1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes.Nature 491, 56–65.
Tennessen, J.A., Bigham, A.W., O’Connor, T.D., Fu,W., Kenny, E.E., Gravel, S., McGee, S., Do, R., Liu, X., Jun, G., et al.; Broad GO; Seattle GO; NHLBI Exome Sequencing Project (2012).Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69.
Dong C, Wei P, Jian X, Gibbs R, Boerwinkle E, Wang K* and Liu X*. (2015) Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Human Molecular Genetics 24(8):2125-2137.
Korvigo I, Afanasyev A, Romashchenko N, et al. Generalising Better: Applying Deep Learning To Integrate Deleteriousness Prediction Scores For Whole-Exome SNV Studies[J]. bioRxiv, 2017: 126532.
Mahmood K, Jung C, Philip G, et al. Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics[J]. Human Genomics, 2017, 11(1): 10
Zhou Y, Fujikura K, Mkrtchian S, et al. Computational methods for the pharmacogenetic interpretation of next generation sequencing data[J]. Frontiers in pharmacology, 2018, 9: 1437.