snv检测工具：fuwa —— using decision tree

fuwa是一个进行snv/indel检测的工具，采用的是决策树（CART）算法。
这里仅仅观察文中用来作为分类的特征值。

文中选择了12个特征来进行分类学习预测，这些特征可分为四类，分别为read depth, base quality, mapping/alignment quality, strand bias。

第一类: Read Depth

Features under this category measure the absolute depth and depth ratio of reads that are “effective” to be a specific candidate variant. “Effective” means that the read shares the same base as the candidate variant at the candidate’s locus.
也就是统计那些可能是变异位点（candidata）的位置的测序深度，“Effective”也就是与参考基因组不同的意思（因为是要检测变异信息，所以不同于参考基因组的reads深度更有统计价值）。

这类包括3个特征。

effective base depth

Effective Base Depth (EBD) is the sum of the depths of effective reads. For indel reads, the EBD equals the mapping quality, while for SNV reads, the EBD is the value of the mapping quality multiplied by the base quality.

看标题以为是支持变异的测序深度，结果却是跟比对质量和碱基质量相关，由这两个值得来的。

effective base depth ratio

The EBD ratio, i.e., the EBD of one candidate variant divided by the sum of the EBDs of all candidate variants at that locus. If this indicator is very low, the related candidate variant tends to be a random error.

这个特征可以说是针对多等位基因而言的，意义为某一位点不同变异基因型的reads数除以这一位点的所有变异基因型的reads数。

DeltaL

DeltaL is a statistic describing the difference between optimal and suboptimal genotypes. Fuwa first hypothesizes that the variant is true, so the reads covering this locus obey an almost ideal variant model: 0/1 or 1/1. The logarithms of likelihood under these two ideal models are calculated separately, and the
bigger one is selected as L1. Then, Fuwa calculates the second likelihood logarithm, L2, under another hypothesis that the variant is false and that reads covering this locus follow the binomial distribution model. Thus, L1-L2, or DeltaL, is the logarithm of the ratio of the first and second likelihoods. If DeltaL is close to 0, which means the likelihoods of the ideal model and the binomial model are nearly equal, we empirically judged the variant to be false positive; otherwise, the variant tends to be true.

没太看懂怎么算的，大概就是通过统计计算判断哪些等位基因基因型可能是错误的，假阳性的。

第二类： Base quality

This category focuses on the accuracy of a base sequenced by the sequencing machine, which has considerable impact on variant calling.

将碱基质量值相关信息作为特征

Sum of Base Quality (SumBQ)

This feature is the sum of the base quality of effective reads for one candidate variant. For indel reads, this value is set to 30 empirically.

支持变异的reads的碱基质量值之和。

Average Mapping Quality (AveBQ)

By dividing SumBQ by the number of effective reads, we obtain the average mapping quality.

支持变异的reads的碱基质量值之和除以reads数得到均值，以均值作为特征。

Variance of Position (VarPos)

Here, “position” means the offset of the pile-up site from the 3′end of a read. We use this statistic considering that, generally, sequencing quality declines towards the end of a read; thus, candidate variants that are close to the 3′ end are more likely to be sequencing errors.

一般来说，由于机器的原因，测序序列3‘端的序列更可能出错，因此这里将变异位点距离3‘端的距离作为特征。

第三类： Mapping/alignment quality

This category considers how well a read is mapped and aligned to its current locus. Mismatches lead to a higher possibility of false positives.

第三类特征是与比对质量值相关的，一般比对质量值越差，越可能出现假阳性。
对于比对质量值MQ如何计算的看这里。

Average Mapping Quality (AveMQ)

The average of the mapping quality of effective reads at the candidate variant’s locus.

支持变异的reads的平均变异值。

Worst Mapping Quality (WorMQ)

The worst mapping quality of all reads at the candidate variant’s locus.

支持变异位点的reads中最差的碱基质量值。

Poor Mapping Quality Ratio (PoorMQR)

The ratio of reads with mapping quality lower than 15 at the candidate variant’s locus.

支持变异位点的reads中MQ值小于15的reads数的比例。

Average Alignment Score (AveAS)

The alignment score is a different metric than mapping quality, and its computing methods vary from aligner to aligner. Briefly speaking, the alignment score measures the similarity between a read and the reference genome, while mapping quality reflects the specificity that a read tends to be mapped to its current locus instead of other loci. AveAS is the average of the alignment scores of all reads at the candidate variant’s locus.

Alignment Score 是一个与Mapping Quality不同的概念，MQ可以说是一个类似于概率的评估指标，由reads比对到当前位置的错配碱基的质量值计算而来（具体见wiki MQ），而Alignment Score则是用来评估这条reads与参考基因组相似度的一个参数。
AveAS 则是支持此变异位点的所有序列的Alignment Score的均值。

第四类： Strand Bias

This category assumes that effective reads of true positives from positive and negative strands of DNA should be approximately equal.

这一类认为支持变异位点的reads，不论是正链还是负链，都是具有相同作用的，不应该存在偏好性。而且测序时，测到正链和负链的概率应该相等，没有偏好性。

Variance of Strands (VarStr)

Assuming that the numbers of effective reads from positive/negative strands obey the binomial distribution, the variance can be calculated through the formula D(n) = np(1-p). If VarStr is small, it means that reads of the candidate variant cluster in one direction, suggesting a sequencing error or other false positive situations.

假设测序时候正负链是没有分别的，那么他们的概率就应该都为P=0.5，而D(n) = np(1-p),其中p表示正链或则负链（因为两者是互斥关系），D(n)表示支持某一变异位点的reads的方差，明显当正负链被测到的概率相同时（P=0.5）D(n)最大，表示没有因为机器原因出现了链差异性，而当正链或者负链数量明显偏多时，D(n)就会非常小，这时认为出现了链偏好性，变异更可能是假阳性的。

Bias of Strands (BiasStr)

BiasStr is a χ2 value measuring the significance of correlation between “whether a read is effective” and the direction of strand that the read comes from. It is calculated by using a 2 × 2 contingency table.

image.png

image.png

where n = a + b + c + d.
If BiasStr is too high, which means the effective reads of the candidate variant cluster in one strand, the candidate tends to be caused by sequencing error.

这个特征利用卡方检验来检验一条reads是否为“effective reads”与其是正链还是负链间的关系。如果x2值很大，则表明这歌变异更可能假阳性的。

参考：
A study on fast calling variants from next generation sequencing data using decision tree.
GATK4.0和全基因组数据分析实践（下）