关于16S扩增子的分析流程，深圳基因组所的刘永鑫老师已编写了十分详尽的教程，让人人都能上手16S分析，而代谢组分析结果一般都由送样公司直接交付。通过16S+代谢组学的联合分析，我们就能将特定的微生物与特定的功能变化直接联系起来，从而提出“因果假设”。

微生物组（Who is there?） + 代谢组（What are they doing?） = 揭示功能机制

而如何进行微生物-代谢组联合分析，网上相关详细教程似乎比较少。最近阅读了湖南省农科院苏品老师团队在Microbiome发表的Host metabolites explain microbiome variation between different rice genotypes文章，对水稻叶际微生物和代谢物进行了联合分析，发现苯丙素类化合物解释了基因型之间的水稻叶际微生物组变异。文章给出了详细的分析代码（包括输入文件，详见链接），下面就以这些代码为基础，走一遍基础分析流程吧。

一、输入文件准备

16S+代谢组联合分析需要准备两个输入文件，一个是16S的分析结果表格，例如文章里给出的RA.csv文件，每一列表示样本名称，每一行表示微生物名称。（这里使用的是目水平的结果进行分析，如果想使用科、属或其他水平的进行分析都行，我使用excel直接打开文件，导致有些字符显示不兼容）

RA.csv

第二个文件是代谢组分析文件，例如文章里给出的metadata.csv，第一列表示样本名称（应与RA.csv中的样本名称对应），红框中其他几列表示6中代谢物的定量结果（为什么只测这6种，因为这6种物质都属于苯丙素类化合物，具体可详细阅读论文）。

metadata.csv

二、数据预处理

后续整个分析及可视化流程都基于R实现。首先设置工作目录，加载对应R包（未安装对应R包的请手动安装），读取两个输入文件。

setwd("yourpath/microbiome-analysis-code-main/microbiome-analysis-code-main/m-m analysis_data")

library(vegan)

library(ggplot2)

library(tibble)

library(ggpubr)

taxonomy=read.csv("RA.csv", header=T, row.names=1, comment.char="", stringsAsFactors=F)

metadata=read.csv("metadata.csv", header=T, row.names=1, comment.char="", stringsAsFactors=F)

接下来对读取的文件去除缺失值、对齐样本ID、去除全为零的微生物行，最终得到预处理后的metadata和taxonomy表格。

taxonomy[is.na(taxonomy)]<-0

idx = rownames(metadata) %in% colnames(taxonomy)

metadata = metadata[idx, , drop = F]

taxonomy = taxonomy[rowSums(taxonomy)!=0, rownames(metadata)]

定义颜色向量，用于后续绘图中的颜色映射（可选）

colorsF <-c(Pseudomonadales="#F4BD27",Enterobacteriales="#55C1D7",Bacteroidales="#7BCBDF",

Burkholderiales="#B3C84D",Xanthomonadales="#4DFAA1",Rhizobiales="#EEEAB4",Sphingomonadales="#FADB75",

Sphingobacteriales="#A3F4B7",Flavobacteriales="#BAA5E6",

Rhodospirillales="#BDD2C7",Other="#FFFFFF",

HCA="#007808",FLA="#00AC0B",LPA="#009A3E",CFA="#13ED6A",SPLT="#38DF8E",SPA="#1ADFA4")

三、选择分析模型：RDA vs CCA

CCA（典型相关分析）是研究两组变量之间关系的一种多变量统计分析方法，它可以反映两组变量之间的相互依赖的线性关系。RDA（冗余分析）是一种结合回归和主成分分析的方法，是多元回归分析的直接扩展，用于多变量响应数据建模。关于两种方法的具体异同及分析结果解读详见链接1和链接2。

使用decorana判断梯度长度（DCA分析），根据第一轴长度决定使用RDA还是CCA。如果大于4.0,就应选CCA；如果在3.0-4.0之间，选RDA和CCA均可；如果小于3.0, RDA的结果会更合理。

otu.helli=decostand(t(taxonomy),method = "hellinger")

decorana(otu.helli)

上述命令输出结果为2.2561，因此我们后续采用RDA分析。

decorana(otu.helli)输出结果

四、代谢物对微生物组的解释度分析

使用所有6种代谢物作为解释变量，微生物组成作为响应变量；并使用envfit检验每个代谢物与微生物组成的相关性强度和显著性，使用999次置换检验计算p值。并提取结果，$vectors$r: R²值，表示解释度；$vectors$pvals: p值，表示显著性。可以看到输出的数据框中，除了SPA之外的其他5种代谢物P值都很显著，其中HCA的R²值最高，这表明可能HCA这种物质对于微生物组的解释贡献度最大。

otu_rda_all <- rda(t(taxonomy) ~ LPA+HCA+FLA+CFA+SPLT+SPA, data = sampFile)

otu_rda_all_envfit <- envfit(otu_rda_all,sampFile,permu=999)

cor_metabo <- cbind(as.data.frame(otu_rda_all_envfit$vectors$r),as.data.frame(otu_rda_all_envfit$vectors$pvals))

colnames(cor_metabo) = c("r", "p")

cor_metabo <-rownames_to_column(cor_metabo, var = "metabo")

cor_metabo数据框

使用cor_metabo数据框绘制条形图，并命名为p14。

p14 <- ggplot(cor_metabo,aes(fill=metabo, y = r,x =reorder(metabo,-r)))+

geom_bar(color="#BAB6B6",stat = 'identity', width = 0.9,size=1)+

geom_text(aes(y = r+0.05, label =paste("p =",p, sep = ""),colour=metabo) ,size = 4, fontface = "bold") +

labs(x = '', y = '')+

xlab("Metabolite factor")+

ylab(expression(R^"2"))+

theme_classic()+

scale_color_manual(values=colorsF)+

scale_fill_manual(values=colorsF)

p14

接下来计算所有代谢物共同解释的方差比例，并进行显著性检验。可以看到计算出来总方差值variance_all为0.3564，也就是说代谢物对微生物组的解释比例为0.3564，对RDA模型进行置换检验，判断模型是否显著，计算得到的p.val为0.00099。

variance_all = (otu_rda_all$CCA$tot.chi/otu_rda_all$tot.chi)

perm_otu_rda = anova.cca(otu_rda_all, permutations = 1000, parallel = 4)

p.val = perm_otu_rda[1, 4]

提取RDA模型的缩放结果（Scaling 1），提取特征值（Eigenvalues），concont表示约束性部分的贡献度，eig1存储每个RDA轴的解释方差百分比。可以看到RDA1已经解释了99.42%的总方差。

rda.scaling1 <- summary(otu_rda_all, scaling = 1)

eig1 = rda.scaling1$concont$importance

eig1

接下来提取样本坐标并合并代谢物和微生物数据，提取两个特定微生物目Pseudomonadales 和 Xanthomonadales的丰度数据。将样本坐标、代谢物数据和微生物丰度数据合并到一个数据框中。（为什么提取这两个特定的微生物，因为在作者之前的文章中Pseudomonadales 和 Xanthomonadales 被鉴定为与这组代谢物相关性最强、最显著的两个微生物类群）

points <- as.data.frame(scores(otu_rda_all, display = "sites"))

tax<-c("Pseudomonadales","Xanthomonadales")

tax1<-t(taxonomy)

tax1<-as.data.frame(tax1[, tax], row.names = row.names(tax1))

sampFiletax =cbind(sampFile,tax1)

points = cbind(sampFiletax, points)

分别绘制代谢物与两个特定微生物目的RDA图（p10和p9），点的颜色表示对应微生物的相对丰度。通过p10我们可以看到所有Pseudomonadales丰度高的样本都集中在RDA图某个特定区域，这暗示了该微生物与驱动样本分异的代谢物之间存在强关联。而在Xanthomonadales中（p9）则不存在这种结果。

p10 = ggplot(points) +

geom_point(aes(x = RDA1 , y = RDA2, fill=Pseudomonadales),shape=21,alpha = 0.8, size =4, stroke = 0.1,color = "#BAB6B6") +

labs(x = paste("RDA 1 (", format(100 * eig1[2,1], digits = 4), "%)", sep = ""),

y = paste("RDA 2 (", format(100 * eig1[2,2], digits = 4), "%)", sep = ""), color = "black") +

ggtitle(paste(format(100 * variance_all, digits = 3), " % of variance; P = ", format(p.val, digits = 2), sep = "")) +

theme_classic() +

theme(text = element_text(family = "sans", size = 7))+

scale_fill_gradient2(low="#B5CBE2",high="#F4BD27",mid="#FBFBFB",midpoint=40)

p9 = ggplot(points) +

geom_point(aes(x = RDA1 , y = RDA2, fill=Xanthomonadales),shape=21,alpha = 0.8, size =4, stroke = 0.1,color = "#BAB6B6") +

labs(x = paste("RDA 1 (", format(100 * eig1[2,1], digits = 4), "%)", sep = ""),

y = paste("RDA 2 (", format(100 * eig1[2,2], digits = 4), "%)", sep = ""), color = "black") +

ggtitle(paste(format(100 * variance_all, digits = 3), " % of variance; P = ", format(p.val, digits = 2), sep = "")) +

theme_classic() +

theme(text = element_text(family = "sans", size = 7))+

scale_fill_gradient2(low="#B5CBE2",high="#F4BD27",mid="#FBFBFB",midpoint=40)

p10

根据p10展示的结果，提出疑问：是哪种代谢物驱动了Pseudomonadales的丰度变化呢？于是继续从RDA结果中提取解释变量（代谢物）的坐标进行分析并可视化。首先提取所有代谢物坐标，再按R²值降序排列代谢物，取前6个进行分析，得到out_rda_env2数据框，展示了各个代谢物的坐标。

otu_rda_env = as.data.frame(scores(otu_rda_all, display = "bp"))

cor_metabo2<-cor_metabo[order(cor_metabo$r,decreasing = TRUE),]

cor_metabo2<-cor_metabo2[1:6,]

otu_rda_env2<-otu_rda_env[cor_metabo2$metabo,]

otu_rda_env2 <-rownames_to_column(otu_rda_env2, var = "metabo")

out_rda_env2

接下来，以p10为基础，将几种代谢物的坐标添加上去，并添加微生物箭头和标签，得到p11。

p11<-p10+geom_segment(data=otu_rda_env2,aes(x=0,y=0,xend=RDA1*4,yend=RDA2*4,colour=metabo),

,size=0.2,linetype=1,arrow=arrow(angle = 35,length=unit(0.3,"cm"))) +

geom_text(data=otu_rda_env2,aes(x=RDA1*4,y=RDA2*4,label=metabo,colour=metabo),size=3.5,

hjust=(1-sign(otu_rda_env2$RDA1))/2,angle=(180/pi)*atan(otu_rda_env2$RDA2/otu_rda_env2$RDA1))+

geom_vline(xintercept = 0,lty="dashed",color = 'black', size = 0.2)+

geom_hline(yintercept = 0,lty="dashed",color = 'black', size = 0.2)+

scale_color_manual(values=colorsF)+

theme_classic()

p11

五、响应代谢物的微生物变量分析

进行反向RDA，以代谢物作为响应变量，以微生物作为解释变量。并使用envfit检验每个微生物与代谢物含量的相关性强度和显著性，使用999次置换检验计算p值。从输出的数据框otu_cca_envfit中可以看到每种微生物对代谢物组的解释程度（由于微生物种类较多，截图只展示了部分微生物结果）。

rda<-rda(sampFile,t(taxonomy))

otu_cca_envfit<-envfit(rda,t(taxonomy),permu=999)

otu_cca_envfit

提取显著相关的微生物，具体规则是首先提取每个微生物与代谢物的R²和p值，再筛选p < 0.05的显著微生物，按R²降序排列，取前10个。从得到的cor_com3数据框中可以看到这10种微生物，TOP1是Pseudomonadales，与之前的结果互相验证。

cor_com1 <- cbind(as.data.frame(otu_cca_envfit$vectors$r),as.data.frame(otu_cca_envfit$vectors$pvals))

colnames(cor_com1) = c("r", "p")

cor_com1 <-rownames_to_column(cor_com1, var = "microbiome")

cor_com2 <- cor_com1[cor_com1$p<0.05,]

cor_com3<-cor_com2[order(cor_com2$r,decreasing = TRUE),]

cor_com3<-cor_com3[1:10,]

cor_com3

提取TOP3微生物坐标，otu_rda_species中可以看到三种微生物的坐标。

cor_comTAX<-cor_com3[1:3,]

taxname<-cor_comTAX$microbiome

otu_rda_species <- as.data.frame(scores(otu_rda_all, display = "species", scaling = 2))

otu_rda_species = as.data.frame(otu_rda_species[taxname,])

otu_rda_species <-rownames_to_column(otu_rda_species, var = "microbiome")

otu_rda_species

以p11为基础，将三种微生物的坐标添加上去，并添加箭头和标签，得到p12。

p12<-p11+geom_segment(data=otu_rda_species,aes(x=0,y=0,xend=RDA1/12,yend=RDA2/12,colour=microbiome)

,size=0.2,linetype=1,arrow=arrow(angle = 35,length=unit(0.3,"cm"))) +

geom_text(data=otu_rda_species,aes(x=RDA1/12,y=RDA2/12,label=microbiome,colour=microbiome),size=3.5,

hjust=(1-sign(otu_rda_species$RDA1))/2,angle=(180/pi)*atan(otu_rda_species$RDA2/otu_rda_species$RDA1))+

theme_classic()

绘制TOP10微生物与代谢物相关性的条形图

p13 <- ggplot(cor_com3,aes(fill=microbiome, x = r,y =reorder(microbiome,r)))+

geom_bar(color= "#BAB6B6",stat = 'identity', width = 0.9,size=1)+

geom_text(aes(x = r+0.1, label =paste("p =",p, sep = ""),colour=microbiome) ,size = 5, fontface = "bold") +

labs(x = '', y = '')+

ylab("Microbiome factor")+

xlab(expression(R^"2"))+

theme_classic()+

scale_color_manual(values=colorsF)+

scale_fill_manual(values=colorsF)

p13

六、单一代谢物对微生物组的解释度

以HCA为例，进行单一代谢物对微生物组解释度分析。步骤与“四、代谢物对微生物组的解释度分析”中类似。先计算HCA解释的方差比例，再进行置换检验，提取P值。可以看到variance_indiv为0.2420，P值为0.00099。

indiv_rda_all <- rda(t(taxonomy) ~ HCA, data = sampFile)

variance_indiv = (indiv_rda_all$CCA$tot.chi/indiv_rda_all$tot.chi)

perm_indiv_rda = anova.cca(indiv_rda_all, permutations = 1000, parallel = 4)

p.val2 = perm_indiv_rda[1, 4]

提取RDA模型的缩放结果（Scaling 1），提取特征值（Eigenvalues），concont表示约束性部分的贡献度，eig2存储每个RDA轴的解释方差百分比。可以看到RDA1已经解释了24.2%的总方差，PC1解释了49.34%总方差。

indiv_rda.scaling1 <- summary(indiv_rda_all, scaling = 1)

eig2 = indiv_rda.scaling1$cont$importance

eig2

提取每个样本的坐标，绘制RDA 图，点的颜色表示代谢物的浓度。

points2 = as.data.frame(indiv_rda.scaling1$sites)

tax<-c("Pseudomonadales","Xanthomonadales")

tax1<-t(taxonomy)

tax1<-as.data.frame(tax1[, tax], row.names = row.names(tax1))

sampFiletax =cbind(sampFile,tax1)

points2 = cbind(sampFiletax, points2)

p15 = ggplot(points2) +

geom_point(aes(x = RDA1 , y = PC1, fill=HCA),shape=21,alpha = 0.8, size =4, stroke = 0.1, color = "#BAB6B6") +

labs(x = paste("RDA 1 (", format(100 * eig2[2,1], digits = 4), "%)", sep = ""),

y = paste("PC 1 (", format(100 * eig2[2,2], digits = 4), "%)", sep = ""), color = "black") +

ggtitle(paste(format(100 * variance_indiv, digits = 3), " % of variance; P = ", format(p.val2 , digits = 2), sep = "")) +

theme_classic() +

theme(text = element_text(family = "sans", size = 7))+

scale_fill_gradient2(low="#B5CBE2",high="#007808",mid="#FBFBFB",midpoint=1.5)

p15

后续几种代谢物的分析流程跟上述一致，只需在RDA分析时选择对应代谢物即可。

七、单一代谢物与微生物相关性分析

将上述TOP10微生物中的前5个做为待分析对象。并定义后续绘图theme。

tax<-c("Pseudomonadales","Xanthomonadales","Burkholderiales","Sphingomonadales","Sphingobacteriales")

tax2<-t(taxonomy)

tax2<-as.data.frame(tax2[, tax], row.names = row.names(tax2))

sampFiletax2 =cbind(sampFile,tax2)

theme1<-theme(plot.margin=unit(c(0.5,0.5,0.5,0.5),units=,"cm"),

strip.text = element_text(size=12),

axis.line = element_line(color = "black",size = 0.4),

panel.grid.minor = element_blank(),

panel.grid.major = element_blank(),

panel.background = element_blank(),

axis.text.y = element_text(color="black",size=10),

# axis.text.x = element_blank(),

# axis.ticks.x=element_blank(),

panel.spacing.x = unit(0,"cm"),

panel.border = element_blank(),

# legend.position = "none",

panel.spacing = unit(0,"lines")

)

绘制HCA_Pseudomonadales相关性分析结果，通过p21可以看到HCA的浓度与Pseudomonadales呈现正相关的线性关系。其余几种代谢物与微生物之间的单一线性关系分析流程与此一致。

p21 <-ggplot(sampFiletax2,aes(HCA,Pseudomonadales)) +

geom_point(aes(),pch=21,size =4, stroke = 0.1,color= "#BAB6B6",alpha=0.8,fill="#F4BD27")+

#scale_x_continuous(limits = c(1.3, 4.1), breaks = seq(1.3, 4.1, 0.5))+

geom_smooth(method = "lm", formula = NULL,linewidth=0.2,se=T,color="blue",linetype="dashed",aes(group=1))+

stat_cor(label.y = 100,aes(label = paste(..r.label.., ..p.label.., sep = "~`,`~"),group=1),color="black",method = "pearson",

label.x.npc = "left")+theme1

p21

八、总结

论文后续还对微生物组与PAL-SNP进行了关联分析，最终锁定了关键的M基因（Microbiome sharping gene）。后续将继续学习微生物组-SNP的关联分析流程，并且在作者的RDA分析之上添加更多的微生物组-代谢组相关分析方法。

微生物组代谢组联合分析学习笔记（一）

微生物组代谢组联合分析学习笔记（一）

一、输入文件准备

二、数据预处理

三、选择分析模型：RDA vs CCA

四、代谢物对微生物组的解释度分析

五、响应代谢物的微生物变量分析

六、单一代谢物对微生物组的解释度

七、单一代谢物与微生物相关性分析

八、总结

推荐阅读更多精彩内容