Author:ligc
Date:19/5/18
DEG differential expression genes
input1 input2
GeneA: input1-reads count input2-reads count
fold change = input2-reads count / input1-reads count
问题1. 测序量是否一致 :测序量,矫正测序量,normalization
1.1 quantile,1.2 genometry, 1.3 traditional
问题2 怎么去衡量,哪个才是真正显著的基因?
200000 -> 250000 ; 2000 -> 2500; 200 -> 250
哪个才是显著差异的?
基本假设:
1. 大部分的gene表达量不变;
2. 高表达的gene表达量不变;
1.为什么大多数基因表达量不变?
p-value
H0: 两者没有差异;
H1:两者有差异;
统计量x,a=0.05,肯定H0或者否定H1;
input1,2000,input2, 2500,p-value = 0.001
p-value < 0.05 , 小概率事件;单次实验不发生;否定H0;有差异
为什么高表达的gene表达量不变?
因为如果高表达基因表达量发生变化则会影响其他基因表达量也发生变化
input1; geneA,B,C 200 = 150 + 25 + 25
input2; geneA,B,C 200 = 130 + 35 + 35
问题3
如果高表达的基因变了怎么办?
如果大多数表达的基因变了怎么办?
3.1 掺入spike-in(针对以上两种情况)
3.2 house keeping gene进行校正(针对高表达基因)
bioinformatics
核心是为了降噪!而不是提高噪音!
we do not need bioinfor magics
cuffdiff pipeline
FASTQ-reads -> mapping (HISAT2,tophat2) -> BAM -> cuffdiff -> DEGs(FPKM)
DESeq2 pipeline
FASTQ-reads -> mapping (HISAT2,tophat2) -> BAM -> Gene Raw Reads Count -> find DEGs in R (No FPKM)
泊松分布
不太常见的事情发生的次数
比如被雷劈
lambda, E,Var
负二项分布
两盒火柴,A,B,随机取光一盒火柴所用的次数X
sum(A,B) >= X >= min(A,B)
类比于:
reads落在gene1上的概率,即reads_count符合negative binomial distribution
gene1, non-gene1
TPM = 1E6 * FPKM / total FPKM
DESeq2
基本流程:
输入数据:
表达矩阵、样品信息矩阵、差异比较矩阵
步骤:
构建一个dds(DESeqDataSet)的对象
利用DESeq函数进行标准化处理
-
用result函数来提取差异比较的结果
构建dds矩阵
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design= ~ batch + condition)
表达矩阵:即countData,就是通过之前的HTSeq-count生成的reads-count计算融合的矩阵。行为基因名,列为样品名,值为reads或fragment的整数。
样品信息矩阵:即代码中的colData,它的类型是一个dataframe(数据框),第一列是样品名称,第二列是样品的处理情况(对照还是处理等)。可以从表达矩阵中导出或是自己单独建立。
差异比较矩阵:即代码中的design,差异比较矩阵就是告诉差异分析函数哪些是对照,哪些是处理。
library(tidyr)
library(DESeq2)
rm(list = ls())
getwd()
list.files()
count_tab = read.table("merge_count.matrix",header =T)# ,row.name =1
count_tab <- separate(count_tab,gene_id,into = "gene_id",sep = "[.]")
rownames(count_tab) = count_tab$gene_id
count_tab = count_tab[,-c(1)]
colData = read.table("SRR_Acc_List_sample.txt",header = T)
colData$condition = factor(colData$condition,c("control","Akap95"))
dds <- DESeqDataSetFromMatrix(countData = count_tab,
colData = colData,
design= ~ condition)
dds <- DESeq(dds)
resultsNames(dds) # lists the coefficients
res <- results(dds, name="condition_Akap95_vs_control")
res = res[order(res$padj), ]
resDF = as.data.frame(res)
resDF$gene_id = rownames(resDF)
resDF = resDF[ ,c(7,1,2,3,4,5,6)]
##cbind
#resDF <- cbind("gene_id"=rownames(resDF),resDF)
DEG_list <- subset(resDF,padj<0.05 &abs(log2FoldChange)>1)
write.table(DEG_list,file = "mouse_DEG_list",row.names = F,sep = "\t",quote = F)
write.table(resDF,file = "mouse_Akap95_DEG.tab",sep = "\t",quote = FALSE,row.names = FALSE)
# or to shrink log fold changes association with condition:
res <- lfcShrink(dds, coef="condition_Akap95_vs_control", type="apeglm")
## PCA plot
library(ggplot2)
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup=c("condition"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
plot_Fig = ggplot(pcaData, aes(PC1, PC2, color=condition)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
ggsave(plot_Fig,filename = "mouse_DEseq2_PCA.pdf")
# plotPCA(vsd, intgroup=c("condition", "type"))
## MA plot
pdf("mouse_MAplot.pdf")
plotMA(res, ylim=c(-2,2))
dev.off()
mean of normalized counts 表示经过标准化后的平均reads count。
log fold change 基因表达量的倍数变化。
参考文章:
1.https://www.jianshu.com/p/26511d3360c8
2.https://www.jianshu.com/p/5f94ae79f298
3.https://www.bilibili.com/video/av19929350