seqlogo
图常用于展示特定为区域的序列信息,就像这样:
之前很好奇这种图是怎么画出来的,后面知道了一个R包:ggseqlogo
。提供了一系列的可视化方法:
作者也提供了完整的教程:https://omarwagih.github.io/ggseqlogo/。
这种图,重要的是理解数据结构,然后就可以用在自己的数据上了。本文的示例数据在公众号PLANTOMIX
后台回复seqlogo
即可获取。
DNA序列
有两种方法,一种是按照Bits
进行展示,另外一种是以prob
(比例)进行展示。直接将数据放在数据框里面即可:
require(ggplot2)
require(ggseqlogo)
library(stringr)
library(ggsci)
library(tidyverse)
# DNA序列
seq_dna = read.table('data/test.DNA.seq.txt', header = T)
p1 = ggseqlogo(as.character(seq_dna$test.seq), method = 'prob') +
theme_bw() +
scale_y_continuous(labels = scales::percent)
p1
ggsave(p1, filename = 'figures/1.png', width = 5, height = 3)
p1.1 = ggseqlogo(as.character(seq_dna$test.seq), method = 'bits') +
theme_bw() +
scale_y_continuous(labels = scales::percent)
p1.1
ggsave(p1.1, filename = 'figures/1.1.png', width = 5, height = 3)
氨基酸序列
# 氨基酸序列
seq_aa = read.table('data/test.AA.seq.txt', header = T)
p2 = ggseqlogo(as.character(seq_aa$.), method = 'prob') +
theme_bw() +
scale_y_continuous(labels = scales::percent)
p2
ggsave(p2, filename = 'figures/2.png', width = 5, height = 3)
自定义数据
ggseqlogo
支持自定义数据,如数字。
# 自定义序列
seq_diy = matrix(ncol = 1, nrow = 10) %>%
as.data.frame()
for (i in 1:nrow(seq_diy)) {
seq.temp = as.character(sample(1:4,10, replace = T))
seq.temp.2 = seq.temp[1]
for (j in 2:10) {
seq.temp.2 = paste(seq.temp.2, seq.temp[j], sep = '')
}
seq_diy[i,] = seq.temp.2
}
colnames(seq_diy) = 'test.seq'
p4 = ggseqlogo(as.character(seq_diy$test.seq),
method = 'prob',
namespace=1:4) +
theme_bw() +
scale_y_continuous(labels = scales::percent)
p4
ggsave(p4, filename = 'figures/4.png', width = 5, height = 3)
矩阵类型数据
另外一种使用得更多的数据应该是类似这样的:
# matrix数据
seq_matrix = read.table('data/test.matrix.txt', header = T) %>%
as.matrix()
p3 = ggseqlogo(seq_matrix, method = 'bits') +
theme_bw()
p3
ggsave(p3, filename = 'figures/3.png', width = 5, height = 3)
更多可视化方法参照作者教程网站:https://omarwagih.github.io/ggseqlogo/。
参考文献
[1] Li, Ying, et al. "Magnaporthe oryzae Auxiliary Activity Protein MoAa91 Functions as Chitin-Binding Protein To Induce Appressorium Formation on Artificial Inductive Surfaces and Suppress Plant Immunity." Mbio 11.2 (2020).
[2] Wagih, Omar. "ggseqlogo: a versatile R package for drawing sequence logos." Bioinformatics 33.22 (2017): 3645-3647.