在ASEReadCounter完成位点的覆盖度信息计数统计之后,还需要对位点添加基因ID,随后做二项分布和费舍尔精确检验,这里推荐GENEiase软件。
ASE (等位基因特异性表达)—— ASEReadCounter - 简书 (jianshu.com)
GENEiase软件论文:
https://www.nature.com/articles/srep21134.pdf
找到了一个介绍ASE的PPT:
https://scilifelab.github.io/courses/rnaseq/1610/slides/ASE_Olof_Emanuelsson.pdf
1.下载安装
1.1 下载
https://github.com/edsgard/geneiase/tags
1.2 安装
$ tar xvf geneiase-1.0.1.tar.gz
$ cd /your/path/geneiase-1.0.1/bin
geneiase是基于R的,首先需要进入R环境,安装依赖包:
$ R
> install.packages(c('getopt', 'binom', 'VGAM'))
> q()
安装完成后,退出R,即可正常使用geneiase。
$ geneiase
Usage: geneiase [-[-ase.type|t] <character>] [-[-in.file|i] <character>] [-[-out.file|o] <character>] [-[-betabin.p|p] <double>] [-[-betabin.rho|r] <double>] [-[-n.bootstrap.samples|b] <integer>] [-[-min.feat.vars|m] <integer>] [-[-nmax.vars|x] <integer>] [-[-lib.file|l] <character>] [-[-help|h]]
出现Usage,安装成功。
2. 参数
geneiase只需要两个参数,-t和-i:
-t,
"static"或者"icd",
指定数据类型是静态的"static"还是独立的条件依赖"icd"的ASE
-i,
输入文件的文件名
安装包解压后的test文件夹中有两种数据类型的示例数据。
static数据包含四列信息,分别为基因ID(feautureID), snpID, 替代等位基因数(alternative allele count),参考等位基因数目( reference allele count),示例格式:
$ less static.test.input.tab
gene snp.id alt.dp ref.dp
10.9 1 4 6
10.9 2 6 4
10.9 3 5 5
10.9 4 0 10
10.9 5 9 1
10.9 6 5 5
10.9 7 3 7
10.9 8 8 2
10.9 9 7 3
101.2 10 6 4
101.2 11 5 5
103.3 12 4 6
103.3 13 9 1
103.3 14 1 9
105.5 15 5 5
105.5 16 0 10
105.5 17 7 3
icd数据包含六列信息,分别为基因ID,SNPid,未经处理的替代等位基因数目(Untreated alternative allele count), 未处理的参考等位基因数目(Untreated reference allele count), 处理的替代等位基因数目(Treated alternative allele count), 处理的参考等位基因数目(Treated reference allele count),示例格式:
$ less icd.test.input.tab
gene snp.id U.alt.dp U.ref.dp T.alt.dp T.ref.dp
1.11 1 8 2 7 3
1.11 2 3 7 4 6
1.11 3 8 2 6 4
1.11 4 5 5 7 3
1.11 5 6 4 1 9
1.11 6 9 1 5 5
1.11 7 4 6 5 5
3.ASE检验
ASEReadCounter完成位点的覆盖度信息计数统计之后,将结果中的Chr和位点的位置信息提取出来,整理为下列各式的表格:
$ less LPF1_MP_pos.txt
Mpar_chr1 2001 2001
Mpar_chr1 2015 2015
Mpar_chr1 2034 2034
Mpar_chr1 2037 2037
Mpar_chr1 2206 2206
3.1 查找位点的基因信息
bedtools的使用方法,这篇文章有详细的介绍:
最全Bedtools使用说明--只看本文就够了 - 简书 (jianshu.com)
首先对基因组文件position文件进行排序,注意pos文件和gff文件中的染色体名称要一致:
$ bedtools sort -chrThenSizeA -i LPF1_MP.pos > LPF1_MP_sort.pos
$ bedtools sort -chrThenSizeA -i Mparg_v2.0.gff3 > Mparg_v2.0_sort.gff3
返回pos文件中,SNP位点在基因组上的位置:
$ bedtools intersect -a LPF1_MP_sort.pos -b Mparg_v2.0_sort.gff3 -wb > LPF1_MP_gene.pos
3.2 在R中添加基因信息
在ASEReadCounter输出的位点覆盖度信息计数文件结果中,添加上一步得到的基因信息。
ASE (等位基因特异性表达)—— ASEReadCounter - 简书 (jianshu.com)
# 读取LPF1_MP_ASE.table和LPF1_MP_gene.pos
> ASE<-read.table("LPF1_MP_ASE.table",header = T)
> gene<-read.table("LPF1_MP_gene.pos")
创建snp_id:合并LPF1_MP_ASE.table中的contig和position两列,以及CPF1_CE_gene.pos中的V1和V2两列,创建snp_id。
> ASE <- tidyr::unite(ASE, "snp_id", contig, position,remove = FALSE)
> head(ASE)
snp_id contig position variantID refAllele altAllele refCount altCount totalCount
1 Mpar_chr1_4724 Mpar_chr1 4724 . A C 47 39 86
2 Mpar_chr1_4881 Mpar_chr1 4881 . C G 52 33 85
3 Mpar_chr1_4900 Mpar_chr1 4900 . T C 46 31 77
4 Mpar_chr1_4962 Mpar_chr1 4962 . T C 49 34 83
5 Mpar_chr1_4995 Mpar_chr1 4995 . G T 45 44 89
lowMAPQDepth lowBaseQDepth rawDepth otherBases improperPairs
1 0 0 88 0 2
2 0 0 86 1 0
3 0 0 77 0 0
4 0 0 83 0 0
5 0 0 89 0 0
> gene <- tidyr::unite(gene, "snp_id", V1, V2,remove = FALSE)
> head(gene)
snp_id V1 V2 V3 V4 V5 V6 V7 V8
1 Mpar_chr1_29618717 Mpar_chr1 29618717 29618717 Mpar_chr1 AUGUSTUS intron 29618269 29618971
2 Mpar_chr1_29618717 Mpar_chr1 29618717 29618717 Mpar_chr1 AUGUSTUS gene 29617909 29621312
3 Mpar_chr1_29618717 Mpar_chr1 29618717 29618717 Mpar_chr1 AUGUSTUS transcript 29617909 29621312
4 Mpar_chr1_29511536 Mpar_chr1 29511536 29511536 Mpar_chr1 AUGUSTUS CDS 29511235 29511554
5 Mpar_chr1_29511536 Mpar_chr1 29511536 29511536 Mpar_chr1 AUGUSTUS exon 29511235 29511554
V9 V10 V11 V12
1 1 - . Parent=MP1G214900.1
2 1 - . ID=MP1G214900
3 1 - . ID=MP1G214900.1
4 1 - 0 Parent=MP1G214200.1
5 . - . Parent=MP1G214200.1
提取注释中所有的CDS,ASE位于CDS区域更加准确:
> gene<-subset(gene,V6=='CDS')
根据snp_id进行匹配,并添加基因ID在ASE文件中:
> merga<-merge(ASE,gene, by = "snp_id", all.x = TRUE)
> write.csv(merga,"LPF1_MP_merga.csv",row.names = F)
3.3 准备输入文件
以static数据为例,需要四列信息,LPF1_MP_merga.csv中提取:
> raw<-read.csv("LPF1_MP_merga.csv")
> GeneiASE_input<-raw[,c(26,1,8,7)]
> head(GeneiASE_input)
V12 snp_id altCount refCount
1 <NA> Mpar_c2518_pilon_116563 1 395
2 <NA> Mpar_c2518_pilon_132171 3 3
3 <NA> Mpar_c2518_pilon_133271 1 1
4 <NA> Mpar_c2518_pilon_153461 2 5
5 <NA> Mpar_c2518_pilon_155680 2 4
去除gene ID缺失的行:
> GeneiASE_input <- na.omit(GeneiASE_input)
> names(GeneiASE_input)[1] <-"gene_id"
> head(GeneiASE_input)
gene_id snp_id altCount refCount
72 Parent=MP1G130700.1 Mpar_chr1_10006428 14 19
73 Parent=MP1G130700.1 Mpar_chr1_10006455 14 17
87 Parent=MP1G130700.1 Mpar_chr1_10006863 27 24
88 Parent=MP1G130700.1 Mpar_chr1_10006921 23 18
89 Parent=MP1G130700.1 Mpar_chr1_10006970 23 18
写出:
> write.table(GeneiASE_input,"LPF1_MP_GeneiASE_input.tab",quote = FALSE,row.names = FALSE,col.names = T,sep ='\t')
3.4 ASE检验
$ cd your/path/geneiase/bin
$ geneiase -t static -i LPF1_MP_GeneiASE_input.tab -b 100
- -b n.bootstrap.samples
The number of bootstrap samples (B) to be used to generate the null distribution. Default: 1e5
结果文件中包含以下几列:
- feat: 基因ID
- n.vars: 基因变异的数量
- mean.s: Mean of s across the variants within the gene
- median.s: Median of s across the variants within the gene
- sd.s: Standard deviation of s across the variants within the gene
- cv.s: Coefficient of variation of s across the variants within the gene
- liptak.s: Stouffer-Liptak combination of s
- p.nom: Nominal p-value
- fdr: Benjamini-Hochberg corrected p-value
3.5 整理ASE检验结果
> p_value<-read.csv("LPF1_MP_GeneiASE_input.tab.static.gene.pval.tab",sep ='\t')
> names(p_value)[1] <-"gene_id"
> names(raw)[26] <-"gene_id"
> result <- merge(p_value,raw, by = "gene_id", all.x = TRUE)
> result <- result[,c(10:12,14:17,1:9)]
> write.csv(result,"LPF1_MP_result.csv",row.names = F)
引用转载请注明出处,如有错误敬请指出。