最近手里有个非靶向代谢组的数据,通过学习MetaboDiff包来熟悉代谢组分析的思路和流程,接下来的流程来自于MetaboDiff包官方帮助文档。
1. MetaboDiff包安装
library("devtools")
install_github("andreasmock/MetaboDiff")
library(MetaboDiff)
2. 数据处理
2.1数据的导入
MetaboDiff包需要三个数据:
- assay - 包含代谢物的相对丰度的数据矩阵;
- rowData -包含代谢物注释信息的数据 框;
- colData - 包含样本元数据的数据框。
MetaboDiff包自带的示例数据来自于这篇文献AKT1 and MYC Induce Distinctive Metabolic Fingerprints in Human Prostate Cancer。代谢组数据来自于61个前列腺癌病人和25个正常人的前列腺组织。
先查看一下这个三个数据。
> assay[1:5,1:5]
pat1 pat2 pat3 pat4 pat5
met1 33964.73 117318.43 118856.90 78670.7 102565.94
met2 18505.56 167585.32 59621.97 66220.4 74892.27
met3 NA 42373.93 27141.21 NA 38390.78
met4 61638.77 74595.78 NA NA NA
met5 NA 148363.61 43861.79 105835.2 25589.08
> head(colData)
id tumor_normal random_gender group
pat1 cp2 N female Control
pat2 cp7 N female Control
pat3 cp19 N male Control
pat4 cp26 N male Control
pat5 cp29 N female Control
pat6 cp32 N male Control
> head(rowData)
BIOCHEMICAL SUPER_PATHWAY SUB_PATHWAY METABOLON_ID
met1 1-arachidonoylglycerophosphoethanolamine* Lipid Lysolipid 35186
met2 1-arachidonoylglycerophosphoinositol* Lipid Lysolipid 34214
met3 1-arachidonylglycerol Lipid Monoacylglycerol 34397
met4 1-eicosadienoylglycerophosphocholine* Lipid Lysolipid 33871
met5 1-heptadecanoylglycerophosphoethanolamine* No Super Pathway No Pathway 37419
met6 1-linoleoylglycerol (1-monolinolein) Lipid Monoacylglycerol 27447
PLATFORM KEGG_ID HMDB_ID
met1 LC/MS neg <NA> HMDB11517
met2 LC/MS neg <NA> <NA>
met3 LC/MS neg C13857 HMDB11572
met4 LC/MS pos <NA> <NA>
met5 LC/MS neg <NA> <NA>
met6 LC/MS neg <NA> <NA>
#将三个数据集融合成一个以便于下游分析。
> (met <- create_mae(assay,rowData,colData))
A MultiAssayExperiment object of 1 listed
experiment with a user-defined name and respective class.
Containing an ExperimentList class object of length 1:
[1] raw: SummarizedExperiment with 307 rows and 86 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
2.2 代谢物的注释
如果HMDB、KEGG或ChEBI id是rowData数据集的一部分,则可以从小分子通路数据库(SMPDB)检索进行代谢产物注释。
> met <- get_SMPDBanno(met,
+ column_kegg_id=6,
+ column_hmdb_id=7,
+ column_chebi_id=NA)
2.3 处理缺失值
> na_heatmap(met,
+ group_factor="tumor_normal",
+ label_colors=c("darkseagreen","dodgerblue"))
#剔除缺失值,计算代谢物的相对丰度。
> (met = knn_impute(met,cutoff=0.4))
A MultiAssayExperiment object of 2 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 2:
[1] raw: SummarizedExperiment with 307 rows and 86 columns
[2] imputed: SummarizedExperiment with 238 rows and 86 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
2.4 异常值热图
在标准化数据之前,我们需要剔除数据中的异常值。
> outlier_heatmap(met,
+ group_factor="tumor_normal",
+ label_colors=c("darkseagreen","dodgerblue"),
+ k=2)
根据上述热图,设置了k=2, 热图形成了cluster1和cluster2,cluster1相对cluster2便是异常值,我们将剔除cluster1。
> (met <- remove_cluster(met,cluster=1))
harmonizing input:
removing 5 sampleMap rows with 'colname' not in colnames of experiments
harmonizing input:
removing 5 sampleMap rows with 'colname' not in colnames of experiments
removing 5 colData rownames not in sampleMap 'primary'
A MultiAssayExperiment object of 2 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 2:
[1] raw: SummarizedExperiment with 307 rows and 81 columns
[2] imputed: SummarizedExperiment with 238 rows and 81 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
2.5 数据标准化
> (met <- normalize_met(met))
vsn2: 307 x 81 matrix (1 stratum).
Please use 'meanSdPlot' to verify the fit.
vsn2: 238 x 81 matrix (1 stratum).
Please use 'meanSdPlot' to verify the fit.
A MultiAssayExperiment object of 4 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 4:
[1] raw: SummarizedExperiment with 307 rows and 81 columns
[2] imputed: SummarizedExperiment with 238 rows and 81 columns
[3] norm: SummarizedExperiment with 307 rows and 81 columns
[4] norm_imputed: SummarizedExperiment with 238 rows and 81 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
2.6 数据标准化质控
> quality_plot(met,
+ group_factor="tumor_normal",
+ label_colors=c("darkseagreen","dodgerblue"))
harmonizing input:
removing 243 sampleMap rows not in names(experiments)
harmonizing input:
removing 243 sampleMap rows not in names(experiments)
harmonizing input:
removing 243 sampleMap rows not in names(experiments)
harmonizing input:
removing 243 sampleMap rows not in names(experiments)
Warning messages:
1: Removed 5356 rows containing non-finite values (stat_boxplot).
2: Removed 5356 rows containing non-finite values (stat_boxplot).
3. 数据分析
3.1 无监督分析
MetaboDiff包提供了线性降维方法PCA和非线性降维方法tSNE。
> source("http://peterhaschke.com/Code/multiplot.R")
> multiplot(
+ pca_plot(met,
+ group_factor="tumor_normal",
+ label_colors=c("darkseagreen","dodgerblue")),
+ tsne_plot(met,
+ group_factor="tumor_normal",
+ label_colors=c("darkseagreen","dodgerblue")),
+ cols=2)
sigma summary: Min. : 0.486945518988849 |1st Qu. : 0.714292832194587 |Median : 0.752934663223126 |Mean : 0.75914557339073 |3rd Qu. : 0.808081774279559 |Max. : 0.939549187337462 |
Epoch: Iteration #100 error is: 18.6145995899728
Epoch: Iteration #200 error is: 1.54407709770312
Epoch: Iteration #300 error is: 1.22290267643501
Epoch: Iteration #400 error is: 1.11106327484334
Epoch: Iteration #500 error is: 1.03658104678225
Epoch: Iteration #600 error is: 0.976566767973725
Epoch: Iteration #700 error is: 0.951849496540308
Epoch: Iteration #800 error is: 0.93612964053674
Epoch: Iteration #900 error is: 0.914421902208305
Epoch: Iteration #1000 error is: 0.88283039690459
3.2 假设检验
对单个代谢物进行差异分析,主要用T检验和ANOVA分析。
> met = diff_test(met,
+ group_factors = c("tumor_normal","random_gender"))
> str(metadata(met), max.level=2)
List of 2
$ ttest_tumor_normal_T_vs_N :'data.frame': 238 obs. of 3 variables:
..$ pval : num [1:238] 0.0206 0.7808 0.0832 0.0432 0.5859 ...
..$ adj_pval : num [1:238] 0.102 0.904 0.221 0.158 0.758 ...
..$ fold_change: num [1:238] 0.2872 0.0366 -0.3936 -0.5391 -0.1646 ...
$ ttest_random_gender_male_vs_female:'data.frame': 238 obs. of 3 variables:
..$ pval : num [1:238] 0.2318 0.8626 0.4048 0.0121 0.2111 ...
..$ adj_pval : num [1:238] 0.83 0.959 0.862 0.386 0.83 ...
..$ fold_change: num [1:238] -0.1372 -0.0208 0.1742 0.607 0.3438 ...
#以tumor和normal分组进行差异分析
> volcano_plot(met,
+ group_factor="tumor_normal",
+ label_colors=c("darkseagreen","dodgerblue"),
+ p_adjust = FALSE)
> volcano_plot(met,
+ group_factor="tumor_normal",
+ label_colors=c("darkseagreen","dodgerblue"),
+ p_adjust = TRUE)
#以female和male分组进行差异分析
> par(mfrow=c(1,2))
> volcano_plot(met,
+ group_factor="random_gender",
+ label_colors=c("brown","orange"),
+ p_adjust = FALSE)
> volcano_plot(met,
+ group_factor="random_gender",
+ label_colors=c("brown","orange"),
+ p_adjust = TRUE)
3.3 代谢物关联网络分析
相关分析被成功应用在比较转录组分析中揭示具生物学意义的模块的变化情况。同样是思路也可以应用于代谢组数据分析中。
> met_example <- met_example %>%
+ diss_matrix %>% #构建相异矩阵
+ identify_modules(min_module_size=5) %>% #鉴定代谢相关模块
+ name_modules(pathway_annotation="SUB_PATHWAY") %>% #代谢相关模块命名
+ calculate_MS(group_factors=c("tumor_normal","random_gender")) #根据样本性状计算模块之间关联的显著性
alpha: 1.000000
..cutHeight not given, setting it to 0.991 ===> 99% of the (truncated) height range in dendro.
..done.
#代谢相关模块可视化,分级聚类
> WGCNA::plotDendroAndColors(metadata(met_example)$tree,
+ metadata(met_example)$module_color_vector,
+ 'Module colors',
+ dendroLabels = FALSE,
+ hang = 0.03,
+ addGuide = TRUE,
+ guideHang = 0.05, main='')
#代谢相关模块可视化,各模块直接的关系
> par(mar=c(2,2,2,2))
> ape::plot.phylo(ape::as.phylo(metadata(met_example)$METree),
+ type = 'fan',
+ show.tip.label = FALSE,
+ main='')
> ape::tiplabels(frame = 'circle',
+ col='black',
+ text=rep('',length(unique(metadata(met_example)$modules))),
+ bg = WGCNA::labels2colors(0:21))
#代谢相关模块命名,可视化
> ape::plot.phylo(ape::as.phylo(metadata(met_example)$METree), cex=0.9)
#癌症样本和正常样本对应的模块之间的关联显著性,可视化
> MS_plot(met_example,
+ group_factor="tumor_normal",
+ p_value_cutoff=0.05,
+ p_adjust=FALSE)
#不同性别样本对应的模块之间的关联显著性,可视化
> MS_plot(met_example,
+ group_factor="random_gender",
+ p_value_cutoff=0.05,
+ p_adjust=FALSE)
#相关模块中单个代谢产物在不同样品中的差异性检验
> MOI_plot(met_example,
+ group_factor="tumor_normal",
+ MOI = 2,
+ label_colors=c("darkseagreen","dodgerblue"),
+ p_adjust = FALSE) + xlim(c(-1,8))