转录组WGCNA分析

介绍这个包之前，先要搞清楚这个包能干啥。（部分内容摘抄自学术咖）

Q1：WGCNA能干嘛？
A1：能够将表达模式相似的基因进行聚类，并分析模块与特定性状或表型之间的关联关系。具体一点：1）构建分层聚类树(hierarchical clustering tree)，聚类树的不同分支代表不同的基因模块(module)，模块内基因共表达程度高，而分属不同模块的基因共表达程度低。2）探索模块与特定表型或疾病的关联关系，最终达到鉴定疾病治疗的靶点基因、基因网络的目的。

Q2：WGCNA分析结果中总是提到共表达网络，是什么？
A2：共表达网络特指利用基因间的表达相关性预测基因间调控关系的方法，WGCNA是共表达网络分析中最有效的方法之一。

Q3：一般说WGCNA的样品不少于15个，15个样品考虑重复吗？
A3：15个样本这个是包含了生物学重复，比如5个时间点3个重复。

Q4：每个样本有3个生物学重复，不需要对三个重复的表达量求平均值代表该样本吗？
A4：做WGCNA的时候每个样本是独立的，三个生物学重复样本是全部导入做分析，不是取均值再做分析，每个样本都是独立的。

Q5：WGCNA里面一般会提到hubgene，如何确定hubgene?
A5：在WGCNA分析里面，每个基因都会计算连通性，连通性高的就是hubgene。

那么根据它能做的事情，再结合具体的数据，那么我们在做WGCNA之前需要准备的数据有两个：表达量数据和表型数据。
表达量数据，FPKM矩阵即可。
表型数据，即性状数据，比如肿瘤的stage、肿瘤的预后等等。可以是质量性状也可以是数量性状。

1、安装包
你可以直接安装，但是后面会报错。

install.packages("WGCNA")

看了半天发现，是少了一个impute的包。所以需要重新安装。

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("impute")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GO.db")
BiocManager::install("impute")
BiocManager::install("preprocessCore")
install.packages("WGCNA")

2、导入数据

library(impute)
library(WGCNA)
setwd("E:/8资料/2.干细胞/2.WGCNA")
##导入表达量数据
fpkm <- read.csv("DEseq2_diffExpress_fpkm.csv", header=TRUE)
rownames(fpkm) <- fpkm[,1]
fpkm<- fpkm[,-1]
##筛选差异最显著的top500个基因
WGCNA_matrix <- t(fpkm[order(apply(fpkm,1,mad), decreasing = T)[1:500],])
datExpr <- WGCNA_matrix #这个选择top多少，根据你自己的数据确定，25%之类的是可以的。
## 导入表型数据
datTraits <- read.csv("LIHC_RNAi.CSV",header=TRUE)
rownames(datTraits) <- datTraits[,1]
datTraits <- datTraits[,-1]

3、用hclust给所有的样本建树。看看不同个体之间的距离，以及有没有一些具体特别远的个体。

## 查看是否有离群样品(结果辣眼睛)
sampleTree = hclust(dist(datExpr), method = "average")
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="")

4、确定最佳的beta值，

## 确定最佳beta值
powers = c(c(1:10), seq(from = 12, to=30, by=2)) #建立一个numeric，前面是1:10，后面是12:30，间隔为2。

#pickSoftThreshold这个函数的英文解释是Analysis of scale free topology for multiple soft thresholding powers. The aim is to help the user pick an appropriate soft-thresholding power for network construction
#其实就是选一个软阈值，用于构建拓扑结构。（嗯~阈值软硬是咋回事呢？我也不是很清楚。）
sft = pickSoftThreshold(datExpr, powerVector = powers,verbose = 5)
View(sft)

image.png

画图

# Plot the results
par(mfrow = c(1,2)) #实现一页多图，1行2列。
cex1 = 0.9 #设置图中数字字体的大小，也可以后面直接设置cex即可。
# sft$fitIndices[,1]表示powers，-sign(sft$fitIndices[,3])*sft$fitIndices[,2]虽然知道每一项是什么，但是不知道为什么这么计算。
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
       xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
       main = paste("Scale independence"))
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
       labels=powers, cex=cex1, col="red")
# this line corresponds to using an R^2 cut-off of h，加上cutoff的线，这里是0.9（网上的资料都是0.9，应该试试其他的参数，看看有什么不同。）
abline(h=0.90,col="red")  #得到了结果，发现能够达到0.9的最小的软阈值是3，所以后续软阈值应该就是3.

# Mean connectivity as a function of the soft-thresholding power，还有一个connectivity的结果，但是我还不太清楚正常的结果和异常的结果之间到底有什么区别，先这样吧。
plot(sft$fitIndices[,1], sft$fitIndices[,5],
       xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
       main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

5（5.1）、构建共表达矩阵（自动构建网络 + 模块识别）

## 有了表达矩阵和估计好的最佳beta值，就可以直接构建共表达矩阵。 
#blockwiseModules：This function performs automatic network construction and module detection on large expression datasets in a block-wise manner.
#参数networkType 有"unsigned"、"signed"等选择，"signed"与"unsigned"方法的不同会导致阈值和网络的不同，建议选择"signed"。不过即使该步与之后TOMsimilarityFromExpr()函数均选择"signed"，最终输出用于Cytoscape可视化的网络时仍显示"undirected"。
net = blockwiseModules(
                 datExpr,
                 power = sft$powerEstimate, #这就是刚才确定的软阈值。
                 maxBlockSize = 6000, #貌似这个值不能小于你datExpr当中基因的数目。
                 TOMType = "signed", minModuleSize = 30,
                 reassignThreshold = 0, mergeCutHeight = 0.25,
                 numericLabels = TRUE, pamRespectsDendro = FALSE,
                 saveTOMs = TRUE, 
                 verbose = 3
 )
table(net$colors) 
net$colors

image.png

可视化module

# labels2colors：Converts a vector or array of numerical labels into a corresponding vector or array of colors corresponding to the labels. 将每个基因对应的数值转换成颜色。
mergedColors = labels2colors(net$colors)
table(mergedColors)

# plotDendroAndColors ：This function plots a hierarchical clustering dendrogram and color annotation(s) of objects in the dendrogram underneath. 就是画图。
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
                    "Module colors",
                    dendroLabels = FALSE, hang = 0.03, 
                    addGuide = TRUE, guideHang = 0.05)

5（5.2）、构建共表达矩阵（逐渐构建网络 + 模块识别）
Step_1:Co-expression similarity and adjacency

softPower = 3
adjacency = adjacency(datExpr, power = softPower)

Step_2:计算拓扑重叠矩阵（TOM）

# Turn adjacency into topological overlap
TOM = TOMsimilarity(adjacency)
dissTOM = 1-TOM

Step_3:使用TOM(拓扑重叠矩阵)进行聚类，绘制聚类得到的树形图。

# Call the hierarchical clustering function 
geneTree = hclust(as.dist(dissTOM), method = "average")
# Plot the resulting clustering tree (dendrogram) 
sizeGrWindow(12,9) 
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity", labels = FALSE, hang = 0.04)

Step_4:使用dynamic tree cut来识别模块。

#需要事前设置好最小的模块允许包含的变量个数。因为，我们喜欢大的模块，此处设置模块的最小值为30.使用cutreeDynamic()函数对树形图进行切割。
# We like large modules, so we set the minimum module size relatively high: 
minModuleSize = 30
# Module identification using **dynamic tree cut**: 
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM, deepSplit = 2, pamRespectsDendro = FALSE, minClusterSize = minModuleSize)
table(dynamicMods)

#在树形图下方绘制模块，使用plotDendroAndColors()函数
# Convert numeric lables into colors 
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
# Plot the dendrogram and colors underneath
sizeGrWindow(8,6)
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut", dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05, main = "Gene dendrogram and module colors")

Step_5:将基因表达相似的模块进行合并

#为了量化整个模块的共表达相似性，我们计算它们的特征基因，并根据它们的相关性对其进行聚类.
# Calculate eigengenes 
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes 
# Calculate dissimilarity of module eigengenes
MEDiss = 1-cor(MEs)
# Cluster module eigengenes 
METree = hclust(as.dist(MEDiss), method = "average")
# Plot the result 
sizeGrWindow(7, 6) 
plot(METree, main = "Clustering of module eigengenes", xlab = "", sub = "")
#We choose a height cut of 0.25, corresponding to correlation of 0.75, to merge
MEDissThres = 0.25 
# Plot the cut line into the dendrogram 
abline(h=MEDissThres, col = "red")
# Call an automatic merging function 
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3) 
# The merged module colors 
mergedColors = merge$colors 
# Eigengenes of the new merged modules: 
mergedMEs = merge$newMEs
#为了查看合并模块后，网络中模块有什么变化。我们再次绘制基因树形图，并在树形图下面绘制原始模块颜色和合并后的模块颜色.
sizeGrWindow(12, 9) 
#pdf(file = "Plots/geneDendro-3.pdf", wi = 9, he = 6)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors), c("Dynamic Tree Cut", "Merged dynamic"), dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05) 
#dev.off()

Step_6:保存模块相关变量，用于后续的分析.需要保存的变量有①模块的特征基因②模块的数字标签③模块的颜色标签④基因的树形图。

# Rename to moduleColors 
moduleColors = mergedColors 
# Construct numerical labels corresponding to the colors
colorOrder = c("grey", standardColors(50))
moduleLabels = match(moduleColors, colorOrder)-1
MEs = mergedMEs
# Save module colors and labels for use in subsequent parts
save(MEs, moduleLabels, moduleColors, geneTree, file = "FemaleLiver-02-networkConstruction-stepByStep.RData")

6、展示模块之间的相关性

# module eigengene, 可以绘制线图，作为每个模块的基因表达趋势的展示
MEs = net$MEs

### 不需要重新计算，改下列名字就好
### 官方教程是重新计算的，起始可以不用这么麻烦
MEs_col = MEs
colnames(MEs_col) = paste0("ME", labels2colors(as.numeric(str_replace_all(colnames(MEs),"ME",""))))
MEs_col = orderMEs(MEs_col)

# 根据基因间表达量进行聚类所得到的各模块间的相关性图
# marDendro/marHeatmap 设置下、左、上、右的边距
plotEigengeneNetworks(MEs_col, "Eigengene adjacency heatmap", 
                      marDendro = c(3,3,2,4),
                      marHeatmap = c(3,4,2,2), plotDendrograms = T, 
                      xLabelsAngle = 90)

image.png

7、可视化基因网络 (TOM plot)

# 如果采用分步计算，或设置的blocksize>=总基因数，直接load计算好的TOM结果
# 否则需要再计算一遍，比较耗费时间
# TOM = TOMsimilarityFromExpr(dataExpr, power=power, corType=corType, networkType=type)
load(net$TOMFiles[1], verbose=T)

## Loading objects:
##   TOM

TOM <- as.matrix(TOM)

dissTOM = 1-TOM
# Transform dissTOM with a power to make moderately strong 
# connections more visible in the heatmap
plotTOM = dissTOM^7
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA

# Call the plot function
# 这一部分特别耗时，行列同时做层级聚类
TOMplot(plotTOM, net$dendrograms, moduleColors, 
        main = "Network heatmap plot, all genes")

8、模块和性状的关联分析
看完资料之后，性状关联分析貌似有两种处理方法。
第一种：质量性状。一列subtype但是包含有5种类型的癌症。（https://cloud.tencent.com/developer/article/1516749）

image.png

因为关联分析的时候性状必须是数值型，所以需要转换成一个矩阵。
后面就用design=model.matrix(~0+ datTraits$subtype)进行转换，得到的结果如图所示

image.png

table(datTraits$subtype) #列出形状数据中subtype的信息。
if(T){ #判断性状数据是否为空值
  nGenes = ncol(datExpr)
  nSamples = nrow(datExpr)
  design=model.matrix(~0+ datTraits$subtype) #将subtype当中的质量性状转换成0和1的数值矩阵，有几个质量性状，就是几列的矩阵。
  #myt1 <- factor(datTraits$subtype)
  #myt2 <- levels(myt1)
  colnames(design)=levels(factor(datTraits$subtype))#这一步骤是将design的矩阵的列名修改为具体的质量性状。之前网上的那个版本不对，这里修改了一下就好了。
  moduleColors <- labels2colors(net$colors)
  # Recalculate MEs with color labels
  MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes #关于module中Eigengenes的具体含义以及算法参加以下链接（http://www.sci666.net/59227.html)。 
  MEs = orderMEs(MEs0) ##不同颜色的模块的ME值矩 (样本vs模块) 。这个order完了之后可以自己View一下看看。
  moduleTraitCor = cor(MEs, design , use = "p") #模块和性状关联分析。
  moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples) #根据基因数目，计算P值。

  sizeGrWindow(10,6)
  # Will display correlations and their p-values
  textMatrix = paste(signif(moduleTraitCor, 2), "\n(",signif(moduleTraitPvalue, 1), ")", sep = "") #这里定义表上的数值的格式。上面是moduleTraitCor，保留小数后两位；回车之后括号里面是P值，保留小数后一位。
  dim(textMatrix) = dim(moduleTraitCor) #把几行几列的信息给textMatrix。
  png("Module-trait-relationships.png",width = 800,height = 1200,res = 120) #搞个画布
  par(mar = c(6, 8.5, 3, 3)) #设置一下外边距（margin size (mar)）。这四个数字的顺序为bottom, left, top, and right. 默认的参数是 c(5.1, 4.1, 4.1, 2.1)。
  # Display the correlation values within a heatmap plot
  labeledHeatmap(Matrix = moduleTraitCor,
                 xLabels = colnames(design),
                 yLabels = names(MEs),
                 ySymbols = names(MEs),
                 colorLabels = FALSE,
                 colors = greenWhiteRed(50),
                 textMatrix = textMatrix,
                 setStdMargins = FALSE,
                 cex.text = 0.5,
                 zlim = c(-1,1),
                 main = paste("Module-trait relationships"))
  dev.off()
}  ##或者你也可以不输出到png，直接在R里面显示然后保存，都可以。

image.png

除了上面的热图展现性状与基因模块的相关性外。
还可以是条形图,但是只能是指定某个性状。
或者自己循环一下批量出图。

table(datTraits$subtype)
if(T){
  nGenes = ncol(datExpr)
  nSamples = nrow(datExpr)
  design=model.matrix(~0+ datTraits$subtype)
  #myt1 <- factor(datTraits$subtype)
  #myt2 <- levels(myt1)
  colnames(design)=levels(factor(datTraits$subtype))
  moduleColors <- labels2colors(net$colors)
  # Recalculate MEs with color labels
  MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
  MEs = orderMEs(MEs0); ##不同颜色的模块的ME值矩 (样本vs模块)
  moduleTraitCor = cor(MEs, design , use = "p");
  moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)

  
  Luminal = as.data.frame(design[,1])
  names(Luminal) = "Basal"
  y=Luminal
  GS1=as.numeric(cor(y,datExpr, use="p"))
  GeneSignificance=abs(GS1)
  # Next module significance is defined as average gene significance.
  ModuleSignificance=tapply(GeneSignificance,
                            moduleColors, mean, na.rm=T)
  sizeGrWindow(8,7)
  par(mfrow = c(1,1))
  # 如果模块太多，下面的展示就不友好
  # 不过，我们可以自定义出图。
  plotModuleSignificance(GeneSignificance,moduleColors)

}

image.png

9、感兴趣性状的模块的相关性分析

后续还会持续更新。。。。

转录组WGCNA分析

推荐阅读更多精彩内容