PFM-PWM-Seqlogo

参考1：https://sites.google.com/site/zjuwhwsblog/blog/chip-seqshujudemotiffenxizhengli
参考2：https://davetang.org/muse/2013/10/01/position-weight-matrix/

1.详解motif的PWM矩阵

1.首先考虑如何由PFM转成PWM

image.png

I.表示motif 的Matrix-based（矩阵方法），就是利用矩阵将每个位置的A，G，C，T的量都表示出来。该方法又有三种变化，Count-matrix，PFM（position frequency matrix）和PWM（position weight scoring）。Count matirx是每个位置计数得来的，PFM是每个位置的百分比得来的，而PWM是通过取对数得来的。
II.在PFM到PWM的转换过程中，有两个问题，
一是需要对背景的校正，因为基因组上的GC含量与AT含量不同；
二是为了避免有count是0的情况，没法取log，因此经常使用pseudocount来解决这个计算过程中的问题（不同软件pseudocount的方法不一样，具体见http://www.people.vcu.edu/~elhaij/IntroBioinf/Notes/PSSM.pdf）。

采用伪计数方法【+sqrt(total)】，避免count=0,无法计算问题,输出的8个数字，表示总数为8，【注：8 为列和】
出现0次权重-1.936752，出现1次权重为-0.6651985，依次...

公式如下：

# total 表示为列和
pwm <- function(freq, total, bg=0.25){
  #using the formulae above
  p <- (freq + (sqrt(total) * 1/4)) / (total + (4 * (sqrt(total) * 1/4)))
  log2(p/bg)
}
#work out all possible PWM values
for (i in 0:8){
  print(pwm(i,8))
}

[1] -1.936752
[1] -0.6651985
[1] 0
[1] 0.4535419
[1] 0.7980888
[1] 1.076008
[1] 1.308939
[1] 1.509438
[1] 1.685442

计算PWM

a <- c(0, 4, 4, 0, 3, 7, 4, 3, 5, 4, 2, 0, 0, 4)
c <- c(3, 0, 4, 8, 0, 0, 0, 3, 0, 0, 0, 0, 2, 4)
g <- c(2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 6, 8, 5, 0)
t <- c(3, 1, 0, 0, 5, 1, 4, 2, 2, 4, 0, 0, 1, 0)
m <- matrix(data=c(a,c,g,t),nrow=4,byrow=T,dimnames=list(c('a','c','g','t')))
m

mm <- pwm(m,8)
mm
        [,1]       [,2]       [,3]      [,4]       [,5]       [,6]       [,7]       [,8]       [,9]
a -1.9367518  0.7980888  0.7980888 -1.936752  0.4535419  1.5094376  0.7980888  0.4535419  1.0760078
c  0.4535419 -1.9367518  0.7980888  1.685442 -1.9367518 -1.9367518 -1.9367518  0.4535419 -1.9367518
g  0.0000000  0.4535419 -1.9367518 -1.936752 -1.9367518 -1.9367518 -1.9367518 -1.9367518 -0.6651985
t  0.4535419 -0.6651985 -1.9367518 -1.936752  1.0760078 -0.6651985  0.7980888  0.0000000  0.0000000
       [,10]     [,11]     [,12]      [,13]      [,14]
a  0.7980888  0.000000 -1.936752 -1.9367518  0.7980888
c -1.9367518 -1.936752 -1.936752  0.0000000  0.7980888
g -1.9367518  1.308939  1.685442  1.0760078 -1.9367518
t  0.7980888 -1.936752 -1.936752 -0.6651985 -1.9367518

现在我们有了PWM，我们可以为任何DNA序列(相同长度)生成一个定量的分数，方法是将与每个位置观察到的核苷酸相对应的值相加。

seq <- 'ttacataagtagtc'
x <- strsplit(x=seq,split='')


#initialise vector
seq_score <- vector()
#get the corresponding values
for (i in 1:14){
  seq_score[i] <- mm[x[[1]][i],i]
}

seq_score
[1]  0.4535419 -0.6651985  0.7980888  1.6854416  0.4535419 -0.6651985  0.7980888  0.4535419 -0.6651985  0.7980888
[11]  0.0000000  1.6854416 -0.6651985  0.7980888

sum(seq_score)
[1] 5.26307

sum(apply(mm,2,max))
[1] 14.31481

或写成函数

seq <- 'aaaacaaagttca'
Cal_score<-function(seq,mm) {
  seq<-tolower(seq)
  x <- strsplit(x=seq,split='')
  seq_score <- vector()
  for (i in 1:dim(mm)[2]){
    seq_score[i] <- mm[x[[1]][i],i]
  }
  sum(seq_score)
}

Cal_score(seq,mm)

2.如何从PFM矩阵画seqlogo

motif的logo图，经常用来展示motif，其纵坐标是表示每个位置信息熵的大小，经过背景矫正，其公式如下

image.png

library(seqLogo)
a <- c(0, 4, 4, 0, 3, 7, 4, 3, 5, 4, 2, 0, 0, 4)
c <- c(3, 0, 4, 8, 0, 0, 0, 3, 0, 0, 0, 0, 2, 4)
g <- c(2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 6, 8, 5, 0)
t <- c(3, 1, 0, 0, 5, 1, 4, 2, 2, 4, 0, 0, 1, 0)
df <- data.frame(a,c,g,t)

#define function that divides the frequency by the row sum i.e. proportions
proportion <- function(x){
    rs <- sum(x);
    return(x / rs);
}


#create position weight matrix
mef2 <- apply(df, 1, proportion)
mef2 <- makePWM(mef2)
seqLogo(mef2)

image.png

附：对0值进行处理

pwm <- function(freq, total, bg=0.25){
  #using the formulae above
  p <- (freq + (sqrt(total) * 1/4)) / (total + (4 * (sqrt(total) * 1/4)))
  log2(p/bg)
}


Cal_score<-function(seq,mm){
  seq<-toupper(seq)
  x <- strsplit(x=seq,split='')
  seq_score <- vector()
  for (i in 1:(dim(mm)[2]) ){
    seq_score[i] <- as.numeric(mm[x[[1]][i],i])
  }
  sum(seq_score)
}

### 对Matrix 进行计算
A <- c(0, 4, 4, 0, 3, 7, 4, 3, 5, 4, 2, 0, 0, 4)
C <- c(3, 0, 4, 8, 0, 0, 0, 3, 0, 0, 0, 0, 2, 4)
G <- c(2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 6, 8, 5, 0)
T <- c(3, 1, 0, 0, 5, 1, 4, 2, 2, 4, 0, 0, 1, 0)
m <- matrix(data=c(a,c,g,t),nrow=4,byrow=T,dimnames=list(c('A','C','G','T')))

mm<-pwm(m,unique(colSums(m)))

seq <- 'ttacataagtagtc'
Cal_score(seq,mm)

PFM-PWM-Seqlogo

1.首先考虑如何由PFM转成PWM

2.如何从PFM矩阵画seqlogo

附：对0值进行处理

推荐阅读更多精彩内容