started 12/05/2020
rna-seq公司测出来的真是玄学数据,或许是重复测得太少了,反正测出来重复率都不是很好,这次重新分析,去掉两个不太好的重复之后再进行定量分析,记录一下流程。
从read比对结果中获取定量数据
现在我手头已经有公司质控过后的raw data,并且也已经通过hisat2和samtools拿到了bam文件,想进一步进行定量分析,首先需要对基因(转录本)的表达进行定量,我觉得这样说得很合适(出自subread的user guide)
Sequencing reads often need to be assigned to genomic features of interest after they are mapped to the reference genome. This process is often called read summarization or read quantification. Read summarization is required by a number of downstream analyses such as gene expression analysis and histone modification analysis. The output of read summarization is a count table, in which the number of reads assigned to each feature in each library is recorded.
本次使用的是Subread下的featureCounts软件,其对自己的描述是这样的:
featureCounts is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. It can be used to count both RNA-seq and genomic DNA-seq reads.
就是一个帮我们数read的功能,当然它数read是有它自己的思路的,这里建议manual找到第六大节自己看:
http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf
然后一行命令:
featureCounts -a genome.gtf -o counts.txt WT_r1.bam WT_r2.bam S1_r2.bam S1_r3.bam
获得reads统计的结果
通过DEseq2来进行差异表达分析
因为之前r出了问题,我把所有的包还有r都删除重装了,现在要从bioconductor重新装
最新的bioconductor装包的流程已经简单了很多
install.packages("BiocManager")
BiocManager::install("DESeq2")
无数的依赖包一起倾泻在了我的小mac上,然后安装完毕
之前用得都比较随意,没有去好好看过DESeq2的计算原理,现在开始慢慢地啃mannul
mannul看着是相当的痛苦,主要是功能太复杂,而我的需求和脑子太简单,先把最简单的流程过出来差异基因拿到手再说,代码如下
library(DESeq2)
visetwd("./deseq2")
#注意,此处输入的counts.txt是由featurecounts的输出结果删除第一行,再删除左上角的geneid栏得到的
count_data<-read.table("counts.txt",header = TRUE)
count_data<-count_data[,6:9]
condition<-factor(c("WT","WT","S1","S1"))
dds <- DESeqDataSetFromMatrix(count_data,DataFrame(condition),design = ~ condition)
nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
dds <- DESeq(dds)
res <- results(dds)
summary(res)
write.table(res,file="All_results.txt",sep = "\t")
(果然之后也没有再去看模型就开始往下走了,我感觉师姐毕业有一丝危险)
差异表达基因的GO富集分析
非常简单,即将自己的差异表达基因列表丢到agrigo的在线分析网站里即可
#注意使用agri2.0不要跑到1.0的网站去了
#注意自己输入的基因的格式(基因组不同等),不同的格式注释情况不一样跑出来的go结果还差别蛮大的,如果是在水稻里做,可以用我之前自己制作的非常粗糙的两个基因组对照的dictionary,百度云链接放在下面:
拿到了GO的富集表,就可以自己画图然后无中生有一两个results了。。唉。。。结果真差啊。。。想哭