笔记首先总结概括HMM相关理论知识,再通过生信序列的R语言实操对HMM深入了解
1、MM,Markov model 马尔科夫模型
(1)天气举例
- 假设只有晴、阴、雨三种天气情况。某地周一为晴天,已知若晴天后,第二天出现晴阴雨的概率分别为0.5,0.3,0.2(sum=1)。
- 例如周二为雨天的概率即为0.2。而已知若雨天后,第二天出现晴阴雨的概率分别为0.1,0.6,0.3(sum=1),可预测周三为阴天的概率为0.2*0.3=0.06。
- 继续预测,已知若阴天后,第二天出现晴阴雨的概率分别为0.3,0.4,0.3(sum=1),可预测周四出现晴天的概率为0.20.30.4=0.024;
- 综上在已知假设及周一天气情况下,周一~周四天气为晴阴雨晴的概率为0.024,还可以继续预测接下来一周的天气情况概率。
(2)Markov chain马尔可夫链
- 如上假设,即为一个简单的马尔科夫链。首先来说,它是一个离散性质的时序模型,即在顺序离散过程中,一个时间点(周几)代表一种特定的状态(天气)。
-
其次马尔科夫链最重要的假设是:当前的状态只和上一时刻有关,与上一时刻之前的任何状态都没有关系。反映到例子中就是今天的天气情况只取决于昨天的天气情况,与前天,上周的天气都无关。
反映到如下公式中,t+1时间点的状态可能分布只与t时间点的状态有关。
(3)Transition Matrix状态转移概率矩阵!!(划重点)
- stata状态:马尔可夫链在状态空间内的取值称为状态,所有的取值情况称为状态空间 stata space。简单理解就是例子中的三种天气(晴阴雨);
-
状态转移概率矩阵即是指由一种特定状态分别转移到所有状态的概率transition probability(和必然为1),假设存在n种状态,则Transition Matrix就是n*n的矩阵。γij的值就表示状态 i 转移到状态 j 的概率。
-
Transition Matrix的图形展示
(4)initial probabilities of the states
- 上述例子中是给定周一天气为晴天,更多情况下马尔科夫链的开端也是状态的概率分布π;
-
Often, the initial probabilities of the states πi are assumed to be the stationary distribution implied by the transition probability matrix。
即趋近于无穷时刻下,三种状态的分布的概率。
2、R代码实操:马尔科夫模型预测DNA序列
(1)已知条件
A:特殊的序列
For some DNA sequences, the probability of finding a particular nucleotide at a particular position in the sequence does depend on what nucleotides are found at adjacent positions in the sequence.
即符合马尔可夫链的假设
B:转移矩阵
C:初始概率
可任意指定,例如T 0.3 C 0.3 G 0.2 A 0.2
(2)R实操
目的:主要根据转移矩阵预测可能的这样一段特殊序列
- 制作TM
nucleotides <- c("A", "C", "G", "T")
afterAprobs <- c(0.2, 0.3, 0.3, 0.2) # Set the values of the probabilities, where the previous nucleotide was "A"
afterCprobs <- c(0.1, 0.41, 0.39, 0.1) # Set the values of the probabilities, where the previous nucleotide was "C"
afterGprobs <- c(0.25, 0.25, 0.25, 0.25) # Set the values of the probabilities, where the previous nucleotide was "G"
afterTprobs <- c(0.5, 0.17, 0.17, 0.17) # Set the values of the probabilities, where the previous nucleotide was "T"
mytransitionmatrix <- matrix(c(afterAprobs, afterCprobs, afterGprobs, afterTprobs), 4, 4, byrow = TRUE) # Create a 4 x 4 matrix
rownames(mytransitionmatrix) <- nucleotides
colnames(mytransitionmatrix) <- nucleotides
mytransitionmatrix
- 编写函数
主要基于第一个起始碱基的概率与TM概率分布,利用sample()抽样函数进行实现序列预测
generatemarkovseq <- function(transitionmatrix, initialprobs, seqlength)
{
nucleotides <- c("A", "C", "G", "T") # Define the alphabet of nucleotides
mysequence <- character() # Create a vector for storing the new sequence
# Choose the nucleotide for the first position in the sequence:
firstnucleotide <- sample(nucleotides, 1, rep=TRUE, prob=initialprobs)
mysequence[1] <- firstnucleotide # Store the nucleotide for the first position of the sequence
for (i in 2:seqlength)
{
prevnucleotide <- mysequence[i-1] # Get the previous nucleotide in the new sequence
# Get the probabilities of the current nucleotide, given previous nucleotide "prevnucleotide":
probabilities <- transitionmatrix[prevnucleotide,]
# Choose the nucleotide at the current position of the sequence:
nucleotide <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
mysequence[i] <- nucleotide # Store the nucleotide for the current position of the sequence
}
return(mysequence)
}
- 运行函数
myinitialprobs <- c(0.3, 0.3, 0.2, 0.2) #对应函数体设置的"A", "C", "G", "T"顺序
generatemarkovseq(mytransitionmatrix, myinitialprobs, 30)
table(generatemarkovseq(mytransitionmatrix, myinitialprobs, 30))
参考文章
1、Hidden Markov Models — Bioinformatics 0.1 documentation
2、01 隐马尔可夫模型 - 马尔可夫链、HMM参数和性质 - 简书
3、Multilevel HMM tutorial
4、马尔可夫链_百度百科