参考1:https://sites.google.com/site/zjuwhwsblog/blog/chip-seqshujudemotiffenxizhengli
参考2:https://davetang.org/muse/2013/10/01/position-weight-matrix/
1.首先考虑如何由PFM转成PWM
- I.表示motif 的
Matrix-based
(矩阵方法),就是利用矩阵将每个位置的A,G,C,T的量都表示出来。该方法又有三种变化,Count-matrix
,PFM
(position frequency matrix)和PWM
(position weight scoring)。Count matirx
是每个位置计数得来的,PFM
是每个位置的百分比得来的,而PWM
是通过取对数得来的。
- II.在
PFM
到PWM
的转换过程中,有两个问题, - 一是需要对背景的校正,因为基因组上的GC含量与AT含量不同;
- 二是为了避免有
count
是0的情况,没法取log,因此经常使用pseudocount来解决这个计算过程中的问题(不同软件pseudocount的方法不一样,具体见http://www.people.vcu.edu/~elhaij/IntroBioinf/Notes/PSSM.pdf)。
采用伪计数方法【+sqrt(total)
】,避免count=0,无法计算问题,输出的8个数字,表示总数为8,【注:8 为列和】
出现0
次权重-1.936752
,出现1
次权重为-0.6651985
,依次...
- 公式如下:
# total 表示为列和
pwm <- function(freq, total, bg=0.25){
#using the formulae above
p <- (freq + (sqrt(total) * 1/4)) / (total + (4 * (sqrt(total) * 1/4)))
log2(p/bg)
}
#work out all possible PWM values
for (i in 0:8){
print(pwm(i,8))
}
[1] -1.936752
[1] -0.6651985
[1] 0
[1] 0.4535419
[1] 0.7980888
[1] 1.076008
[1] 1.308939
[1] 1.509438
[1] 1.685442
- 计算PWM
a <- c(0, 4, 4, 0, 3, 7, 4, 3, 5, 4, 2, 0, 0, 4)
c <- c(3, 0, 4, 8, 0, 0, 0, 3, 0, 0, 0, 0, 2, 4)
g <- c(2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 6, 8, 5, 0)
t <- c(3, 1, 0, 0, 5, 1, 4, 2, 2, 4, 0, 0, 1, 0)
m <- matrix(data=c(a,c,g,t),nrow=4,byrow=T,dimnames=list(c('a','c','g','t')))
m
mm <- pwm(m,8)
mm
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
a -1.9367518 0.7980888 0.7980888 -1.936752 0.4535419 1.5094376 0.7980888 0.4535419 1.0760078
c 0.4535419 -1.9367518 0.7980888 1.685442 -1.9367518 -1.9367518 -1.9367518 0.4535419 -1.9367518
g 0.0000000 0.4535419 -1.9367518 -1.936752 -1.9367518 -1.9367518 -1.9367518 -1.9367518 -0.6651985
t 0.4535419 -0.6651985 -1.9367518 -1.936752 1.0760078 -0.6651985 0.7980888 0.0000000 0.0000000
[,10] [,11] [,12] [,13] [,14]
a 0.7980888 0.000000 -1.936752 -1.9367518 0.7980888
c -1.9367518 -1.936752 -1.936752 0.0000000 0.7980888
g -1.9367518 1.308939 1.685442 1.0760078 -1.9367518
t 0.7980888 -1.936752 -1.936752 -0.6651985 -1.9367518
- 现在我们有了PWM,我们可以为任何DNA序列(相同长度)生成一个定量的分数,方法是将与每个位置观察到的核苷酸相对应的值相加。
seq <- 'ttacataagtagtc'
x <- strsplit(x=seq,split='')
#initialise vector
seq_score <- vector()
#get the corresponding values
for (i in 1:14){
seq_score[i] <- mm[x[[1]][i],i]
}
seq_score
[1] 0.4535419 -0.6651985 0.7980888 1.6854416 0.4535419 -0.6651985 0.7980888 0.4535419 -0.6651985 0.7980888
[11] 0.0000000 1.6854416 -0.6651985 0.7980888
sum(seq_score)
[1] 5.26307
sum(apply(mm,2,max))
[1] 14.31481
或写成函数
seq <- 'aaaacaaagttca'
Cal_score<-function(seq,mm) {
seq<-tolower(seq)
x <- strsplit(x=seq,split='')
seq_score <- vector()
for (i in 1:dim(mm)[2]){
seq_score[i] <- mm[x[[1]][i],i]
}
sum(seq_score)
}
Cal_score(seq,mm)
2.如何从PFM矩阵画seqlogo
-
motif的logo图,经常用来展示motif,其纵坐标是表示每个位置信息熵的大小,经过背景矫正,其公式如下
library(seqLogo)
a <- c(0, 4, 4, 0, 3, 7, 4, 3, 5, 4, 2, 0, 0, 4)
c <- c(3, 0, 4, 8, 0, 0, 0, 3, 0, 0, 0, 0, 2, 4)
g <- c(2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 6, 8, 5, 0)
t <- c(3, 1, 0, 0, 5, 1, 4, 2, 2, 4, 0, 0, 1, 0)
df <- data.frame(a,c,g,t)
#define function that divides the frequency by the row sum i.e. proportions
proportion <- function(x){
rs <- sum(x);
return(x / rs);
}
#create position weight matrix
mef2 <- apply(df, 1, proportion)
mef2 <- makePWM(mef2)
seqLogo(mef2)
附:对0值进行处理
pwm <- function(freq, total, bg=0.25){
#using the formulae above
p <- (freq + (sqrt(total) * 1/4)) / (total + (4 * (sqrt(total) * 1/4)))
log2(p/bg)
}
Cal_score<-function(seq,mm){
seq<-toupper(seq)
x <- strsplit(x=seq,split='')
seq_score <- vector()
for (i in 1:(dim(mm)[2]) ){
seq_score[i] <- as.numeric(mm[x[[1]][i],i])
}
sum(seq_score)
}
### 对Matrix 进行计算
A <- c(0, 4, 4, 0, 3, 7, 4, 3, 5, 4, 2, 0, 0, 4)
C <- c(3, 0, 4, 8, 0, 0, 0, 3, 0, 0, 0, 0, 2, 4)
G <- c(2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 6, 8, 5, 0)
T <- c(3, 1, 0, 0, 5, 1, 4, 2, 2, 4, 0, 0, 1, 0)
m <- matrix(data=c(a,c,g,t),nrow=4,byrow=T,dimnames=list(c('A','C','G','T')))
mm<-pwm(m,unique(colSums(m)))
seq <- 'ttacataagtagtc'
Cal_score(seq,mm)