Dragon star Day 2 Pt.1 关于序列比对、计分方法、BLAST原理及判断值、BWT原理

Dragon star Day 2 20190730

关于序列比对、计分方法、BLAST原理及判断值、BWT原理

补丁笔记之终于上手画了
-为什么这么难?
-因为你是指南针呀
-。


Dragonstar2019 by Kai Wang

  1. Alignment of short/long-read sequencing data
  2. Genome assembly by short/long-read sequencing

Part Ⅰ Alignment of short/long-read sequencing data

1 Sequence similarity

Sequence similarity is a measure of an empirical relationship between sequences.

A similarity score is therefore aimed to approximate the evolutionary distance between a pair of nucleotide or protein sequences.

Heringa, Jaap(Jul 2008) Sequence Similarity. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005317.pub2]

2 Sequence alignment

  • Consider matches, mismatches, insertions and deletions
    (indels are gaps)
  • Find an optimal alignment between sequences
  • Applications
    • Reference-based read mapping
    • Genome assembly
    • Gene finding

2.1 Pairwise alignment

Sequence alignment between two sequences

  • Find an optimal alignment between two sequences

3 Dynamic programming

动态规划(英语:Dynamic programming,简称DP)是一种在数学管理科学计算机科学经济学生物信息学中使用的,通过把原问题分解为相对简单的子问题的方式求解复杂问题的方法。

动态规划常常适用于有重叠子问题和最优子结构性质的问题,动态规划方法所耗时间往往远少于朴素解法。

动态规划背后的基本思想非常简单。大致上,若要解一个给定问题,我们需要解其不同部分(即子问题),再根据子问题的解以得出原问题的解。

https://zh.wikipedia.org/wiki/动态规划

  • Purpose
    • To find an alignment between two sequences with best matching scores
  • Three components
    • Recursive calculation - 递归计算
    • Tabular arrangement - 填充得分矩阵
    • Traceback - 追溯本源
  • Three common types of pairwise alignments
    • Global alignment: Needleman-Wunsch
    • Local alignment: Smith-Waterman
    • Semi-global alignment

3.1 Global alignment: Needleman-Wunsch

  • Best global alignment

    Have maximal alignment score with all bases in two sequences

  • Score function

3.1.1 Working of Needleman -Wunsch Algorithm

The dynamic programming matrix is defined with three different steps.

  1. Initialization of the matrix with the scores possible.
  2. Matrix filling with maximum scores.
  3. Trace back the residues for appropriate alignment.

vlab.amrita.edu,. (2012). Global alignment of two sequences - Needleman-Wunsch Algorithm. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1

  • Assume we have two sequences: P and Q

    • P = TCATGGC
    • Q = TCATC
  • Initial matrix

  • Initialization Step

    Gap score is defined as penalty given to alignment, when we have insertion or deletion.

    This example assumes that there is gap penalty.

    从 Score function 可知空位罚分为1

    vlab.amrita.edu,. (2012). Global alignment of two sequences - Needleman-Wunsch Algorithm. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1

  • Matrix Fill Step

    全部写出来太乱了,就先杰样吧

  • Trace back Step

    The final step in the algorithm is the trace back for the best alignment.

    By continuing the trace back step by the above defined method, one would reach to the 0th row, 0th column.

    The best alignment among the alignments can be identified by using the maximum alignment score which may be user defined.

    vlab.amrita.edu,. (2012). Global alignment of two sequences - Needleman-Wunsch Algorithm. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1

    多种比对结果:

    purple line:
    TCATGGC
    ||||--|
    TCAT--C
    
    blue line:
    TCATGGC
    ||||-|-
    TCAT-C-
    
    yellow line:
    TCATGGC
    |||||--
    TCATC--
    

3.2 Local alignment: Smith-Waterman

史密斯-沃特曼算法(Smith-Waterman algorithm)是一种进行局部序列比对(相对于全局比对)的算法,用于找出两个核苷酸序列或蛋白质序列之间的相似区域。该算法的目的不是进行全序列的比对,而是找出两个序列中具有高相似度的片段。

该算法由坦普尔·史密斯(Temple F. Smith)和迈克尔·沃特曼(Michael S. Waterman)于1981年提出。史密斯-沃特曼算法是尼德曼-翁施算法的一个变体,二者都是动态规划算法。这一算法的优势在于可以在给定的打分方法下找出两个序列的最优的局部比对(打分方法使用了置换矩阵和空位罚分)。该算法和尼德曼-翁施算法的主要区别在于该算法不存在负分(负分被替换为零),因此局部比对成为可能。回溯从分数最高的矩阵元素开始,直到遇到分数为零的元素停止。分数最高的局部比对结果在此过程中产生。

https://zh.wikipedia.org/史密斯-沃特曼算法

  • Best local alignment

    Have maximal alignment score with a subset of bases in two sequences

  • Score function

3.2.1 Working of Smith-Waterman Algorithm

The basic steps for the algorithm are similar to that of Needleman-Wunsch algorithm. The steps are:

  1. Initialization of a matrix.
  2. Matrix Filling with the appropriate scores.
  3. Trace back the sequences for a suitable alignment.

vlab.amrita.edu,. (2012). Smith-Waterman Algorithm - Local Alignment of Sequences. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1

可见步骤都是一致的,用同样的序列和 score function 进行比对

  • Assume we have two sequences: P and Q

    • P = TCATGGC
    • Q = TCATC
  • Initial matrix

  • Initialization Step

  • Matrix Fill Step

  • Trace back the sequences for a suitable alignment

    1. Find a best score recursively
    2. Check which operation is used to obtain the current alignment score
    3. Stop when an alignment is 0

    所以最后应该是这个亚子:

    TCAT
    ||||
    TCAT
    

4 BLAST: The seed-index-map-extend-merge strategy

The BLAST = Basic Local Alignment Search Tool algorithm is a heuristic for computing optimal "local alignments" between a query sequence and a database containing one or more subject sequences.

https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf

4.1 BLAST Basics

  • BLAST is used for sequence similarity search, and much faster than Smith-Waterman method.
  • Compare a query sequence against a database of sequences.
  • BLAST is a collection of algorithms
    • BLASTN: nucleotide sequence against nucleotide database
    • BLASTP: protein sequence against protein database
    • BLASTX: translated nucleotide sequence against protein database
    • TBLASTX: six-frame translations of a nucleotide query sequence
      against the six-frame translations of a nucleotide sequence database
    • TBLASTN: protein sequence against translated nucleotide database

4.2 Process: The seed-index-map-extend-merge strategy

Most popular algorithms use a seed-and-extend approach that operates in two steps:

  1. Find a set of short exact matches called seeds.
  2. Try to extend each seed match to obtain a long inexact match

These matching words are called seeds and BLAST then attempts to extend each seed to a HSP (High-Scoring Segment Pair).

https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf>

4.2.1 The Process
  • Seed and index

    • Construct common words (k-mers) for sequences in a database

    • Assume the database has two sequences and k=3

      A “k-mer” is a word of DNA that is k long.

      Typically we extract k-mers from genomic assemblies or read data sets by running a k-length window across all of the reads and sequences.

      k-mers are most useful when they’re long, because then they’re specific.

      The important concept here is that long k-mers are species specific.

      https://angus.readthedocs.io/en/2017/kmers-and-sourmash.html

  • Assume all 3-mers are high-score words

    BLAST needs a threshold to determine high-score words.

  • Seed-map with high-score words

    Obtain words from a query sequence

  • Extend-merge

    • Extend: high-scoring segment

    • BLAST output: 2 alignment containing match information and E-value for each alignment

      E-value 的解释见 4.2.2

4.2.2 Statistical significance of alignment
  • How to assess the significance of a high-scoring hit to the database?

    One crucial feature of the BLAST algorithm is that it gives an estimation of the statistical signficance of a determined HSP. This is based on so-called Karlin-Altschul statistics for local alignments (without gaps), which involves the Poisson distribution and other concepts.

    即存在 E-value, P-value

    https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf>

    E is the number of alignments expected by chance during a database search.

  • Karlin-Altschul equation: E(S) = Kmne^{-λS}

    • Score (S) from scoring matrix
    • 𝐾 and λ are two constants
    • 𝑚: the length of a query sequence
    • 𝑛: the sum of bases of all sequences in database
    • The lower E value indicates more significant alignment.

    E value represents the number of matches with score ≥ S that one would expect to see between two random sequences of lengths n and m.

    https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf>

  • How to assess the P-value of finding a high-scoring pair (HSP)?

    • The number of random HSPs with score >= S is described by a Poisson distribution

    • The probability of finding exactly k HSPs with score >=S is given by P(X = k) = e^{−E} *E^k/k!

    • The probability of finding at least one HSP by chance is

      P(S) = 1 − P(X = 0) = 1 − e^{−E}

  • E-value vs P-value

    • E = − ln(1 − P)
    • When E and P are very small, they are almost identical

5 BWT

Technically, it is a lexicographical reversible permutation of the characters of a string.

The most important application of BWT is found in biological sciences where genomes(long strings written in A, C, T, G alphabets) don’t have many runs but they do have many repeats.

https://www.geeksforgeeks.org/burrows-wheeler-data-transform-algorithm/

5.1 Bowtie/BWA

Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).

http://bowtie-bio.sourceforge.net/index.shtml

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.

http://bio-bwa.sourceforge.net/

BWA and Bowtie are the most famous implementations of BWT in sequence alignment.

5.2 BWT procedures

  • Given a string S = “TCATC”, BWT

    • Adds $ and assume $<any alphabet 在最后加上$
    • Obtains a suffix array of all cyclic rotations 轮转后缀矩阵
  • Cyclic rotation

    • Keeps an index of the rotated strings in the array 前面的数字
    • Creates Circular Permutation Table (CPT) 表格本身
  • Sorting

    • Sorts the Circular Permutation Table (CPT) alphabetically
    • Keep the index with the strings 数字不变,横向为单位
    • Taking the last column from the sorted array
  • BWT output

    • The last column represents the transformed string

5.3 BWT reverse transformation

With a suffix array, it is easy to recover the original string S


最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,402评论 6 499
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,377评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,483评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,165评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,176评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,146评论 1 297
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,032评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,896评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,311评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,536评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,696评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,413评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,008评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,659评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,815评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,698评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,592评论 2 353

推荐阅读更多精彩内容