简介
主成分分析(PCA)和偏最小二乘法(PLS)是对变量数超过样本数量或变量之间存在多重共线性的组学数据进行可视化、回归、分类和特征选择的常用方法。
PLS和正交偏最小二乘法(OPLS)是有监督的模式,它们使用偏最小二乘回归建立代谢物表达量与样本类别之间的关系模型,实现对样品类别的预测,是一种建模类型的方法 相较而言,OPLS能够分别对相关因子和不相关变异进行建模,虽然计算方式与PLS相同,但OPLS具有更强的解释性。
而且,当无监督(PCA)无法很好地区分组间样本时,PLS-DA可以实现有效分离。并且PLS-DA和OPLS-DA所构建的分类预测模型,可进一步用于识别更多的样本类别,这是探索性的PCA方法无法做到的。
另外,PLS-DA和OPLS-DA所构建的分类模型中的载荷图可用于衡量各代谢物组分对样本分类判别的影响强度和解释能力,辅助标志代谢物的筛选。 ## 实例解读
例1. 不同品种代谢产物OPLS-DA loading plot1
OPLS-DA loading plots for different mulberry cultivars
The metabolites with loadings that were distant from the origin on the OPLS-DA-loading plots were inferred to make the greatest contribution to class separation. Organic acids, lipids, and flavonols were major contributors to the separation among the three mulberry fruits.
安装2
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ropls")
内置数据介绍
library(ropls)
library(ggplot2)
# 载入内置模拟数据
data(sacurine)
该数据是通过液相色谱-高分辨率质谱(LC-HRMS)来研究年龄、体重指数(bmi)和性别对尿液中代谢物浓度的影响,是一个list具体包含:
- The dataMatrix matrix of numeric type containing the intensity profiles (log10 transformed)
- The sampleMetadata data frame containg sample metadata
- The variableMetadata data frame containg variable metadata
Urine samples were analyzed by using an LTQ Orbitrap in the negative ionization mode. A total of 109 metabolites were identified or annotated at the MSI level 1 or 2. After retention time alignment with XCMS, peaks were integrated with Quan Browser. Signal drift and batch effect were corrected, and each urine profile was normalized to the osmolality of the sample. Finally, the data were log10 transformed. - Thevenot(2015)3
代谢物对个体性别的定性响应 - PLS
由于目的是识别给定数据集的特征,而不是建模预测未知数据的分类,在这里将所有数据均作为训练集构建模型。
执行PLS-DA
# 提各指标读数
dataMatrix <- sacurine$dataMatrix
# 提取性别数据列
genderFc <- sacurine$sampleMetadata[, "gender"]
# 不指定或orthoI = 0时,执行PLS
sacurine.plsda <- opls(dataMatrix, genderFc, orthoI = 0)
## PLS-DA
## 183 samples x 109 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.275 0.73 0.584 0.262 3 0 0.05 0.05
结果参数解释
opls的计算结果中常用对象包括:
names(attributes(sacurine.plsda))
## [1] "typeC" "descriptionMC" "modelDF" "summaryDF"
## [5] "subsetVi" "pcaVarVn" "vipVn" "orthoVipVn"
## [9] "coefficientMN" "xMeanVn" "xSdVn" "yMeanVn"
## [13] "ySdVn" "xZeroVarVi" "scoreMN" "loadingMN"
## [17] "weightMN" "orthoScoreMN" "orthoLoadingMN" "orthoWeightMN"
## [21] "cMN" "uMN" "weightStarMN" "coMN"
## [25] "suppLs" "class"
- 执行模型名称(typeC):PCA, PLS, PLS-DA, OPLS, or OPLS-DA
- 数据整体描述(descriptionMC):样本数,变量,响应值数目
- 数据标准化方法
- 模型概述(modelDF):R2X, R2X(cum), R2Y, R2Y(cum), Q2, Q2(cum), pQ2, significance, iterations(查看S4)
- summaryDF:cumulated R2X, R2Y and Q2;RMSEE
- 得分矩阵(scoreMN):样本在各轴中的坐标
- 载荷矩阵(loadingMN):代谢物在各轴中的坐标
模型评估
执行PLS统计建模时,一般会同时给出4个图片:
PLS-DA model of the gender response
显著性诊断(左上):实际和模拟模型的R2Y和Q2Y值经随机排列后的散点图,模型R2Y和Q2Y(散点)大于真实值时(横线),表明产生过拟合2。
Inertia(惯量)柱形图(右上):通过展示累计解释率评估正交组分是否足够
离群点展示(左下):通过scoreMN和loadingMN计算出各样本在投影平面及正交平面的坐标,并标明相互差异较大的样本。
x-score plot(右下):各样本在PLS-DA轴中的坐标;R2X、R2Y等值展示在下方,用于评估模型优度:
- R2X和R2Y表示模型对X矩阵和Y矩阵的解释率
- Q2Y∈[0,1], Q2Y越大,模型预测性能越好。
常用可视化参数解释
-
typeVc
- “summary” [default]: permutation, overview, outlier, x-score
- “permutation”:
- ‘overview’: 展示PCA分析的累计解释率(Variance explained)
- ‘outlier’: 离群值诊断(scoreMN和loadingMN)
- “x-score”:
- ‘correlation’: 变量和各成分的相关性图
- ‘x-loading’: 贡献较大的组分被红色标记
-
parAsColFcVn
- score plot 中的着色参数
-
parEllipsesL
- 是否显示椭圆
-
fig.pdfC
- interactive’ (default):图像随opls函数运行而显示
- ‘none’:不显示图片
plot(sacurine.plsda, typeVc = 'x-loading')
ggplot2可视化 - loading plot
与其说是可视化方法,不如称为数据提取章节。
# 提取score坐标
scoreMN <- sacurine.plsda@scoreMN
scoreMN <- cbind(scoreMN, sacurine$sampleMetadata)
scoreMN$samples <- rownames(scoreMN)
# 提取loading坐标
loadingMN <- sacurine.plsda@loadingMN
# 解释率
x_lab <- sacurine.plsda@modelDF[1, "R2X"] * 100
y_lab <- sacurine.plsda@modelDF[2, "R2X"] * 100
ggplot(data = data.frame(loadingMN)) + geom_point(aes(p1, p2)) +
labs(x = paste0("p1(", x_lab, ")"), y = paste0("p2(", y_lab, ")"))
变量投影重要度(VIP)
通过变量投影重要度(Variable Importance for the Projection,VIP),可以衡量各代谢物组分含量对样本分类判别的影响强度和解释能力,辅助标志代谢物的筛选(阈值通常设为1)。
vipVn <- sacurine.plsda@vipVn # getVipVn()
vipVn_select <- vipVn[vipVn > 1]
# cbind Metadata and vipVn
vipVn_select <- cbind(sacurine$variableMetadata[names(vipVn_select), ], vipVn_select)
# rename
names(vipVn_select)[4] <- 'VIP'
# reorder
vipVn_select <- vipVn_select[order(vipVn_select$VIP, decreasing = TRUE), ]
# plot(sacurine.plsda, typeVc = "x-loading") 展示前10个
head(vipVn_select)
## msiLevel hmdb chemicalClass VIP
## Testosterone glucuronide 2 HMDB03193 Lipids:Steroi 2.369640
## Malic acid 1 HMDB00156 Organi 2.076340
## p-Anisic acid 1 HMDB01101 AroHoM 2.075301
## Pantothenic acid 1 HMDB00210 AliAcy 1.878811
## Oxoglutaric acid 2 HMDB00208 Organi 1.750144
## Acetylphenylalanine 1 HMDB00512 AA-pep 1.581015
代谢物对个体性别的定性响应 - OPLS
Orthogonal partial least squares(OPLS) 将观测值矩阵X的差异分为两个部分:第一部分代表与Y相关的差异,第二部分代表与Y不相关(正交垂直)的差异,结果展示时需要结合起来讨论;由于OPLS区分了无关变量数据,从而使模型更加容易解读。
另外,OPLS可以更好地避免过拟合现象,预测性能优势并没有明显提升;因此,如果PLS-DA模型尚可:“summary”的4个plot的结果比较好,仍推荐使用PLS-DA。
执行OPLS
# 通过orthoI指定正交组分数目
# orthoI = NA时,执行OPLS,并通过交叉验证自动计算适合的正交组分数
# predI 响应值组数:默认设定为1(for OPLS)
sacurine.oplsda <- opls(dataMatrix, genderFc, predI = 1, orthoI = NA)
## OPLS-DA
## 183 samples x 109 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.275 0.73 0.602 0.262 1 2 0.05 0.05
# plot(sacurine.oplsda, typeVc = 'x-loading')
数据提取
执行OPLS后的数据提取,与PLS和PCA略有不同,需要同时考虑得分矩阵和正交矩阵。
- Score矩阵
scoreMN <- cbind(sacurine.oplsda@scoreMN[, 1], sacurine.oplsda@orthoScoreMN[, 1])
# rename
colnames(scoreMN) <- c("h1", paste0("o", 1))
x_lab <- paste0("t1(", sacurine.oplsda@modelDF[1, "R2X"] * 100, "%)")
y_lab <- "to1"
head(scoreMN)
## h1 o1
## HU_011 -1.582933 -4.9806037
## HU_014 1.372806 -1.7443382
## HU_015 -3.341370 -3.4372771
## HU_017 -3.590063 -0.9794960
## HU_018 -1.662716 0.3155845
## HU_019 -2.312923 0.6561281
- Loading矩阵
loadMN <- cbind(sacurine.oplsda@loadingMN[, 1], sacurine.oplsda@orthoLoadingMN[, 1])
# rename
colnames(loadMN) <- c("h1", paste0("o", 1))
x_lab <- sacurine.plsda@modelDF[1, "R2X"] * 100
y_lab <- paste0("pOrtho1(", sacurine.oplsda@modelDF[1, "R2X"] * 100, "%)")
head(loadMN)
## h1 o1
## (2-methoxyethoxy)propanoic acid isomer 0.062923148 0.01065787
## (gamma)Glu-Leu/Ile -0.119757449 0.11419759
## 1-Methyluric acid -0.058040188 0.18668459
## 1-Methylxanthine -0.045780287 0.16374339
## 1,3-Dimethyluric acid -0.061941784 0.14526897
## 1,7-Dimethyluric acid -0.003216022 0.14375613
Overfitting
过度拟合(Overfitting)是当机器学习应用于具有比样本更多变量的数据集的主要问题;前期随机数实验表明:当变量的数量超过样本的数量时,可以实现完美的PLS-DA分类。而,当样本数量超过观测的数量时,PLS过度拟合可能发生。因此,有必要通过标签的随机排列来检查模型的Q2Y值是否显著。
reference
1. Li, H. et al. Abnormal expression of bHLH3 disrupts a flavonoid homeostasis network, causing differences in pigment composition among mulberry fruits. Hortic Res 7, 83 (2020).
2. Thevenot, E. A., Roux, A., Xu, Y., Ezan, E. & Junot, C. Analysis of the human adult urinary metabolome variations with age, body mass index and gender by implementing a comprehensive workflow for univariate and opls statistical analyses. Journal of Proteome Research 14, 3322–3335 (2015).
3. Thévenot, E. A., Roux, A., Xu, Y., Ezan, E. & Junot, C. Analysis of the human adult urinary metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for univariate and opls statistical analyses. 14, 3322–3335 (2015).