序列比对其他相关问题

全局比对 Global Alignment

由芝加哥的Needleman和Wunsch两位于上个世纪70年代初提出，常被称之为Needleman-Wunsch算法。算法针对用户指定的打分函数，确定性地找出两条序列间的最优比对。

Needle-Wunsch 算法对两条序列所有残基进行全局比对的局限性。

功能相关的蛋白之间虽然可能在整体序列上相差甚远，却常常会具有相同的功能域
序列片段能够独立发挥特定的生物学功能，却在不同蛋白之间相当保守
仅靠全局比对的算法无法发现这样的片段
内含子的发现使得在做核酸水平的序列比对时必须要正确处理内含子导致的大片段的差异

局部比对 Local Alignment

1981年，物理学家Temple Smith和数学家Michael Waterman在Journal of Molcular Biology上发表了一篇仅有四页的文章，提出Smith-Waterman局部比对算法

race back begins at the highest score inthe matrix and continues until you reach 0。And also the secondary best alignment

img

核心思想是给分数增加了下限0分

所有的回溯都是局部的，所有的最终比对也是局部的。引入止损下限，差异扩大之后重启比对，找到局部水平的相似性

空位罚分的改进

Affine gap penalty

opening a gap receives a score ofd
extending a gap receives a score of e
Penalty = d + (n-1)* e

有限向量机

alt

扩展公式

alt

全局比对的时间复杂性

O(mn) 正比于m*n

全基因组比对

同源homology的分类

直系同源 ortholog 来自于不同的物种，演化过程中基因没有丢失，各物种中都有

chaining
旁系同源 paralog 来自于一个物种内部基因组的复制

netting

NGS: Sequence alignment

Map the large numbers of short reads to a reference genome

In a broader sense: Identify similar sequences (DNA, RNA, or protein) inconsequence of functional, structural, or evolutionary relationships between the them
Applications: Genome assembly, SNP detection, homology search, etc

short: greater search sensitivity

large: faster search speed

In-exact alignment

BWA和bowtie的相关算法，大大减少了对服务器的要求

如何快速的知道某段序列大约在基因组的哪个位置

如何定义大约这个概念

Hamming Distance or Sequence Similarity
Ungapped vs Gapped Global vs Local
All positions or the single best

Efficiency depends on the data characteristics & goals

Smith-Waterman: Exhaustive search for optimal alignments
BLAST: Hash-table based homology searches
Bowtie: BWT alignment for short read mapping

BWT算法

核心是绕最后一个序列转圈

alt

先给每一个T做rotations,再进行sort,生成bw矩阵，最后一列从头到尾就是BWT
回溯

给每一个T的字母一个出现次数的排序,图示如下

alt

Suffix（后缀） Arrays

类似于电话本中的索引结构

What if we need to check many queries

We don't need to check every page of the phone book to find 'Ma'
Sorting alphabetically lets us immediately skip 96% (25/26) of the book withoutany loss in accuracy

Sorting the genome: Suffix Array (Manber & Myers1991)

Sort every suffix of the genome

alt

所有具有相同prefix(前缀)的suffixes(后缀)会聚在一起，这样就可以进行类似于二分法的排除

全基因组建立index

alt

An array of integers giving the startingpositions of suffixes of a string inlexicographical order
从中间的index开始找，过滤一半

alt

效率

Total Runtime: O(m log n)

More complicated, but much faster!
Looking up a query loops 32 times instead of 3B

Searching the array is very fast, but it takes time to construct

This time will be amortized over many, many searches
Run it once "overnight" and save it away for all future queries
非常消耗内存

BLAST/Dot matrix

Indexing-based local alignment

alt

Basic Local Alignment Search Tool

围绕最优比对路径进行计算

BLAST Ideas 核心思想: Seeding‐and‐extending

Find matches (seed) between the query and subject 找到高度相似的小片段，种子
Extend seed into High Scoring Segment Pairs (HSPs) 向两端延伸并进行比对

Run Smith‐Waterman algorithm on the specified region only
Assess the reliability of the alignment 打分评估

将序列切分，在数据库中定位候选序列和位置

alt

得到候选序列和查询序列的heatmap

去掉零散的hit,直留下对角线，形成hit cluster

以hit cluster为基础向左右进行延伸直到分数不符合要去

在扩展的区域进行局部比对

alt

blast加速

标记低复杂度，易产生假阳性
考虑与种子相似的邻居单字

分数评估，避免随机因素

E value

alt

n数据库大小
k和打分矩阵相关
m长度
s比对的分数

在随机情况下获得比当前分数高的可能比对条数，不是概率是个望值。p为0.05时，E也是0.05。

BLAST是一种启发式算法，不确定有最优解

alt

只在有效区域应用动态规划算法

加入靠谱熊基地，和大家一起交流

序列比对其他相关问题