总的来说,主要会利用5个R包,来处理关于基因组区间(Genomic Range)的信息
IRanges、GenomicRanges:顾名思义,就是处理基因组区间信息 【GenomicRanges可以将染色体坐标范围、序列名称和正负链信息结合起来。许多Bioconductor包都高度依赖IRanges和GenomicRanges提供的底层数据结构】
GenomicFeatures:处理基因组上的gene model或其他序列特征(gene feature)信息(比如:genes、exons、UTRs、transcripts)
A Gene Model is defined as any description of a gene product from a variety of sources including computational prediction, mRNA sequencing, or genetic characterization.
The gene feature is meant to approximately cover the region of nucleic acid considered by workers in the field to be the gene. Biostrings、BSgenome:处理序列,比如提取子集【Biostrings包含了序列比对、模式匹配等基本序列分析函数;BSgenome针对有注释的全基因组数据进行操作】
rtracklayer:读取像BED、GTF、WIG这样的文件 【rtracklayer包将数据导入USCS基因组浏览器(http://genome-asia.ucsc.edu/)进行浏览、操作、输出】
Before diving into working with genomic ranges, we’re going to get our feet wet with generic ranges (i.e., ranges that represent a contiguous subsequence of elements over any type of sequence)
从简单的基因区间入手,就像批量跑流程一样,由小及大,逐渐培养区间思维"range thinking"
# 例如:创建一个从4-13的区间
> rng <- IRanges(start = 4,end = 13)
> rng
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 4 13 10
# 给定一个起始位置
> IRanges(start=4, width=3)
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 4 6 3
# 给定一个终止位置
> IRanges(end=5, width=5)
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1 5 5
> x <- IRanges(start=c(4, 7, 2, 20), end=c(13, 7, 5, 23))
> x
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 4 13 10
[2] 7 7 1
[3] 2 5 4
[4] 20 23 4
> names(x) <- paste0("gene",1:4)
> x
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
gene1 4 17 14
gene2 7 11 5
gene3 2 9 8
gene4 20 27 8
> str(x)
Formal class 'IRanges' [package "IRanges"] with 6 slots
..@ start : int [1:4] 4 7 2 20
..@ width : int [1:4] 14 5 8 8
..@ NAMES : chr [1:4] "gene1" "gene2" "gene3" "gene4"
..@ elementType : chr "ANY"
..@ elementMetadata: NULL
..@ metadata : list()
# 找到起始位点
> start(x)
[1] 4 7 2 20
# 找到终止位点
> end(x)
[1] 13 7 5 23
# # 找到区间长度
> width(x)
[1] 10 1 4 4
# 将所有的终止位点增加4
> end(x) <- end(x) + 4
> x
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
gene1 4 17 14
gene2 7 11 5
gene3 2 9 8
gene4 20 27 8
# 使用range()可以得到总区间
> range(x)
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 2 27 26
# 找到起始位点小于5的
> x[start(x) < 5]
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
gene1 4 17 14
gene3 2 9 8
# 找到长度大于8的基因
> x[width(x) > 8]
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
gene1 4 17 14
# 查看gene3的长度信息
> x['gene3']
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
gene3 2 9 8
# 还可以轻松组合(merge)
> a <- IRanges(start=7, width=4)
> b <- IRanges(start=2, end=5)
> c(a,b)
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 7 10 4
[2] 2 5 4
# 现有的编码区假设是这样 > x <- IRanges(start=c(40, 80), end=c(67, 114)) > x IRanges object with 2 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 40 67 28 [2] 80 114 35 # 上、下游同时增加4bp > x+4L IRanges object with 2 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 36 71 36 [2] 76 118 43 # 看一下变化:start和end都同时向自己的方向增加了4bp
# 原始序列范围 > y <- IRanges(start=c(4, 6, 10, 12), width=13) > y IRanges object with 4 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 4 16 13 [2] 6 18 13 [3] 10 22 13 [4] 12 24 13 # 现在只想保留y的所有基因的5-13bp区域(因为每个基因的起始、终止都不同,如果基因起始位点在5之前的,截取后就从5开始;如果起始位置在5之后的,截取后就从现在的位置开始;终止位置也是如此) > restrict(y,5,13) IRanges object with 4 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 5 13 9 [2] 6 13 8 [3] 10 13 4 [4] 12 13 2
获得两端部分(flank),比如想得到左侧:转录起始位点( transition start site,TSS);右侧:转录终止位点(transcription termination site, TTS)
> x IRanges object with 2 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 40 67 28 [2] 80 114 35 # 现在要获得左侧7bp的起始位点(flank默认是计算上游) > flank(x,width = 7) IRanges object with 2 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 33 39 7 [2] 73 79 7 # 如果要计算下游的终止位点(就要设置start=FALSE)
函数,将重叠的区域进行缩减#模拟数据(随机产生20个起始位点,然后区间长度为5) > set.seed(12) > alns <- IRanges(start=sample(seq_len(50), 20), width=5) > head(alns, 4) IRanges object with 4 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 4 8 5 [2] 41 45 5 [3] 46 50 5 [4] 13 17 5 # 将重叠区域缩减,最后的总覆盖是1-26和28-54 > reduce(alns) IRanges object with 2 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 1 26 26 [2] 28 54 27
函数得到(从前一个大区域的结尾到后一个大区域的开头,这算一个gap)> gaps(alns) IRanges object with 1 range and 0 metadata columns: start end width <integer> <integer> <integer> [1] 27 27 1 # 因为上面只得到两组大区域(可以想象成两个基因区域),然后找基因间区,就是27bp这个位置了
> a <- IRanges(start=4, end=13) > b <- IRanges(start=12, end=17) # 求a、b交集 > intersect(a, b) IRanges object with 1 range and 0 metadata columns: start end width <integer> <integer> <integer> [1] 12 13 2 # 求a对b的补集(就是a中有b中没有的部分);颠倒参数就是求b对 # a补集 > setdiff(a,b) IRanges object with 1 range and 0 metadata columns: start end width <integer> <integer> <integer> [1] 4 11 8 # 并集是union