使用DEseq2对RNA-seq数据进行分析,并计算FPKM和TPM。
该过程使用GenomicFeatures包获取外显子长度,并计算非重叠外显子长度之和作为基因长度。选取这一因素作为基因长度是参考文章https://www.jianshu.com/p/3c21da32d7a4
step1. 安装并加载包:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager",repos = "https://mirrors.tongji.edu.cn/CRAN/") if (!requireNamespace("GenomicFeatures", quietly = TRUE)) BiocManager::install("GenomicFeatures") if (!requireNamespace("DESeq2", quietly = TRUE)) BiocManager::install("DESeq2")
library(GenomicFeatures)
library(DESeq2)
step2. 读取gtf文件,计算基因长度:
txdb <- makeTxDbFromGFF("84kV1_3.gtf", format = "gtf", circ_seqs = character())
ebg <- exonsBy(txdb, by="gene")
exon_gene_sizes <- sum(width(reduce(ebg)))
step3. 读取DEseq2所需文件:
data_in = read.csv("mycounts1.csv", head=TRUE,row.names =1, check.names = FALSE)
countData = as.data.frame(data_in)
colData = read.csv("mymeta-hj.csv", head=TRUE)
names(colData)=c("id", "condition")
step4. 构建DEseq2数据,并使用fpkm()计算FPKM值
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ condition)
mcols(dds)$basepairs <- exon_gene_sizes
fpkm_out = fpkm(dds)
write.table(as.data.frame(fpkm_out), file="fpkm_out.txt",sep="\t",quote = FALSE)
注:这一步本来使用的rowRanges(dds)<- ebg ,但后续分析过程中发现,这样导入基因长度可能会出现counts和length不匹配的情况,即将一个基因的长度赋值给另一个基因。原因是 ebg <- exonsBy(txdb, by="gene")这一步会按照基因id排序,而countDATA是按照输入counts文件的基因顺序,rowRanges并没有按照基因id将数据合并,而是直接合并。使用mcols(dds)$basepairs的方式可以避免这一情况,将数据按照geneid合并。
step5. FPKM转换成TPM
fpkm<-read.table("fpkm_out.txt", sep = "\t", head=TRUE, row.names = 1, check.names = FALSE)
head(fpkm)
fpkmToTpm <- function(fpkm){
exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}
tpms <- data.frame(apply(fpkm,2,fpkmToTpm), check.names = FALSE)
head(tpms)
colSums(tpms)
write.table(as.data.frame(tpms), file="TPM_out.txt",sep="\t",quote = FALSE)