snv检测工具:fuwa —— using decision tree

fuwa是一个进行snv/indel检测的工具,采用的是决策树(CART)算法。
这里仅仅观察文中用来作为分类的特征值。


文中选择了12个特征来进行分类学习预测,这些特征可分为四类,分别为read depth, base quality, mapping/alignment quality, strand bias

第一类: Read Depth

Features under this category measure the absolute depth and depth ratio of reads that are “effective” to be a specific candidate variant. “Effective” means that the read shares the same base as the candidate variant at the candidate’s locus.
也就是统计那些可能是变异位点(candidata)的位置的测序深度,“Effective”也就是与参考基因组不同的意思(因为是要检测变异信息,所以不同于参考基因组的reads深度更有统计价值)。

这类包括3个特征。

effective base depth

Effective Base Depth (EBD) is the sum of the depths of effective reads. For indel reads, the EBD equals the mapping quality, while for SNV reads, the EBD is the value of the mapping quality multiplied by the base quality.

看标题以为是支持变异的测序深度,结果却是跟比对质量和碱基质量相关,由这两个值得来的。

effective base depth ratio

The EBD ratio, i.e., the EBD of one candidate variant divided by the sum of the EBDs of all candidate variants at that locus. If this indicator is very low, the related candidate variant tends to be a random error.

这个特征可以说是针对多等位基因而言的,意义为某一位点不同变异基因型的reads数除以这一位点的所有变异基因型的reads数。

DeltaL

DeltaL is a statistic describing the difference between optimal and suboptimal genotypes. Fuwa first hypothesizes that the variant is true, so the reads covering this locus obey an almost ideal variant model: 0/1 or 1/1. The logarithms of likelihood under these two ideal models are calculated separately, and the
bigger one is selected as L1. Then, Fuwa calculates the second likelihood logarithm, L2, under another hypothesis that the variant is false and that reads covering this locus follow the binomial distribution model. Thus, L1-L2, or DeltaL, is the logarithm of the ratio of the first and second likelihoods. If DeltaL is close to 0, which means the likelihoods of the ideal model and the binomial model are nearly equal, we empirically judged the variant to be false positive; otherwise, the variant tends to be true.

没太看懂怎么算的,大概就是通过统计计算判断哪些等位基因基因型可能是错误的,假阳性的。

第二类: Base quality

This category focuses on the accuracy of a base sequenced by the sequencing machine, which has considerable impact on variant calling.

将碱基质量值相关信息作为特征

Sum of Base Quality (SumBQ)

This feature is the sum of the base quality of effective reads for one candidate variant. For indel reads, this value is set to 30 empirically.

支持变异的reads的碱基质量值之和。

Average Mapping Quality (AveBQ)

By dividing SumBQ by the number of effective reads, we obtain the average mapping quality.

支持变异的reads的碱基质量值之和除以reads数得到均值,以均值作为特征。

Variance of Position (VarPos)

Here, “position” means the offset of the pile-up site from the 3′end of a read. We use this statistic considering that, generally, sequencing quality declines towards the end of a read; thus, candidate variants that are close to the 3′ end are more likely to be sequencing errors.

一般来说,由于机器的原因,测序序列3‘端的序列更可能出错,因此这里将变异位点距离3‘端的距离作为特征。

第三类: Mapping/alignment quality

This category considers how well a read is mapped and aligned to its current locus. Mismatches lead to a higher possibility of false positives.

第三类特征是与比对质量值相关的,一般比对质量值越差,越可能出现假阳性。
对于比对质量值MQ如何计算的看这里

Average Mapping Quality (AveMQ)

The average of the mapping quality of effective reads at the candidate variant’s locus.

支持变异的reads的平均变异值。

Worst Mapping Quality (WorMQ)

The worst mapping quality of all reads at the candidate variant’s locus.

支持变异位点的reads中最差的碱基质量值。

Poor Mapping Quality Ratio (PoorMQR)

The ratio of reads with mapping quality lower than 15 at the candidate variant’s locus.

支持变异位点的reads中MQ值小于15的reads数的比例。

Average Alignment Score (AveAS)

The alignment score is a different metric than mapping quality, and its computing methods vary from aligner to aligner. Briefly speaking, the alignment score measures the similarity between a read and the reference genome, while mapping quality reflects the specificity that a read tends to be mapped to its current locus instead of other loci. AveAS is the average of the alignment scores of all reads at the candidate variant’s locus.

Alignment Score 是一个与Mapping Quality不同的概念,MQ可以说是一个类似于概率的评估指标,由reads比对到当前位置的错配碱基的质量值计算而来(具体见wiki MQ),而Alignment Score则是用来评估这条reads与参考基因组相似度的一个参数。
AveAS 则是支持此变异位点的所有序列的Alignment Score的均值。

第四类: Strand Bias

This category assumes that effective reads of true positives from positive and negative strands of DNA should be approximately equal.

这一类认为支持变异位点的reads,不论是正链还是负链,都是具有相同作用的,不应该存在偏好性。而且测序时,测到正链和负链的概率应该相等,没有偏好性。

Variance of Strands (VarStr)

Assuming that the numbers of effective reads from positive/negative strands obey the binomial distribution, the variance can be calculated through the formula D(n) = np(1-p). If VarStr is small, it means that reads of the candidate variant cluster in one direction, suggesting a sequencing error or other false positive situations.

假设测序时候正负链是没有分别的,那么他们的概率就应该都为P=0.5,而D(n) = np(1-p),其中p表示正链或则负链(因为两者是互斥关系),D(n)表示支持某一变异位点的reads的方差,明显当正负链被测到的概率相同时(P=0.5)D(n)最大,表示没有因为机器原因出现了链差异性,而当正链或者负链数量明显偏多时,D(n)就会非常小,这时认为出现了链偏好性,变异更可能是假阳性的。

Bias of Strands (BiasStr)

BiasStr is a χ2 value measuring the significance of correlation between “whether a read is effective” and the direction of strand that the read comes from. It is calculated by using a 2 × 2 contingency table.


image.png
image.png

where n = a + b + c + d.
If BiasStr is too high, which means the effective reads of the candidate variant cluster in one strand, the candidate tends to be caused by sequencing error.

这个特征利用卡方检验来检验一条reads是否为“effective reads”与其是正链还是负链间的关系。如果x2值很大,则表明这歌变异更可能假阳性的。

参考:
A study on fast calling variants from next generation sequencing data using decision tree.
GATK4.0和全基因组数据分析实践(下)

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,684评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,143评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,214评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,788评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,796评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,665评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,027评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,679评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,346评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,664评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,766评论 1 331
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,412评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,015评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,974评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,203评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,073评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,501评论 2 343

推荐阅读更多精彩内容