简单的例子
library("BSgenome.Hsapiens.UCSC.hg19")
str = "ACGCCTCGAGCGTAGCGTAGCGT"
matchPattern("CG", str)
cat("\n")
POS = start(matchPattern("CG", str))
chr = rep("chr1", length(POS))
start = POS
end = POS + 1
strand = rep("+", length(POS))
grList = list()
GRx <- GRanges(seqnames = Rle(chr),
ranges = IRanges(start,end),
strand = Rle(strand) )
grList[[1]] <- GRx
grList[[1]]
cgGR = unlist(GRangesList(grList))
save(cgGR, file = "hg19_CpG_sites.RData")
运行结果如下:
Views on a 23-letter BString subject
subject: ACGCCTCGAGCGTAGCGTAGCGT
views:
start end width
[1] 2 3 2 [CG]
[2] 7 8 2 [CG]
[3] 11 12 2 [CG]
[4] 16 17 2 [CG]
[5] 21 22 2 [CG]
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 5 2-3 +
[2] 5 7-8 +
[3] 5 11-12 +
[4] 5 16-17 +
[5] 5 21-22 +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
如此根据染色体序号循环即可得到保存整个基因组的CpG位点的RData。
Biostrings
BSgenome.Hsapiens.UCSC.hg19
包是基于IRanges,GenomeInfoDb,GenomicRanges, Biostrings,XVector这些包所所构建的。
Biostrings
是BSgenome.Hsapiens.UCSC.hg19
的一个基础包,其中的matchPattern()
函数用于根据设定的pattern
寻找目标string
的起始和结束位点。结合BSgenome.Hsapiens.UCSC.hg19
包的基因组数据可用于建立一些数据集,如CpG位点data;或保存其他的定序列的位点信息。