微生物网络(1)_FastSpar网络构建

高通量测序获得的微生物群落数据通常具有2个重要特征：稀疏化（sparse）和成分性（compositional nature）。前者即矩阵中存在很多0值，后者即数据对样本total counts进行标准化形成相对丰度。但是，标准化后的样本之间存在非独立性，使得计算的相关矩阵在统计上是无效的。因此，为了对稀疏化的组成数据进行稳健的、无偏的统计分析，首先要将单纯形转换为欧式实空间。基于对数比，Friedman and Alm开发了SparCC算法，通过中心对数比转换恢复了OTUs之间的独立性，并使其成分值有正有负。通过对数比的方差来评估OTUs之间的线性Pearson相关性。这里，基于“强相关的是少数的”这个假设，对数比方差近似于实际OTUs方差。

R环境中通过SpiecEasi包的sparcc和sparccboot两个函数完成相关性和P值计算，进而构建作用网络。但是，对于微生物动辄几百、千条OTUs的数据（即使经过优势种筛选等等），计算还是很耗时间

> data(amgut1.filt)
> sparcc.amgut <- sparcc(amgut1.filt)
> sparcc.graph <- abs(sparcc.amgut$Cor) >= 0.6
> diag(sparcc.graph) <- 0
> sparcc.p <- sparccboot(as.data.frame(t(amgut1.filt)), R = 100)
# 随机置换100次，耗时很长

FastSpar的SparCC算法是通过C++实现，比最初的Python2版本快几千倍，并且占用的内存更少，同时提供了线程支持。ubuntu下直接conda安装。

安装并检测函数调用

(base) lg@LG:~$ conda install -c bioconda -c conda-forge fastspar

(base) lg1199@LG:~/testdata$  fastspar
Program: FastSpar (c++ implementation of SparCC)
Version 1.0.0
Contact: Stephen Watts (s.watts2@student.unimelb.edu.au)

Usage:
  fastspar [options] --otu_table <path> --correlation <path> --covariance <path>

  -c <path>, --otu_table <path>
                OTU input OTU count table
  -r <path>, -correlation <path>
                Correlation output table
  -a <path>, --covariance <path>
                Covariance output table

Options:
  -i <int>, --iterations <int>
                Number of interations to perform (default: 50)
  -x <int>, --exclusion_iterations <int>
                Number of exclusion interations to perform (default: 10)
  -e <float>, --threshold <float>
                Correlation strength exclusion threshold (default: 0.1)
  -t <int>, --threads <int>
                Number of threads (default: 1)
  -s <int>, --seed <int>
                Random number generator seed (default: 1)
  -y, --yes
                Assume yes for prompts (default: unset)

实际数据计算cor, cov

# 输入数据结构
> head(SpaD1)
LG1 LG2 LG3  LG5  LG6 LG7  LG8
OTU7641    20  16  22    4   15   3   12
OTU4872    10  15  11    4    6   4    5
OTU14004    7   6   4  102  126   2   36
OTU13136  895 244 106 2418 1491 157  289
OTU304    152  63 101  137 1549 106  161
OTU16660 3356 982 429 6183 7674 264 1043

fastspar --threshold 0.6 --iterations 100 --otu_table /home/lg1199/testdata/SpaD1.tsv --correlation /home/lg1199/testdata/median_correlation.tsv --covariance /home/lg1199/testdata/median_covariance.tsv

置换数据集cor, cov

# 建立文件夹，放置置换数据，并计算每个数据集cor, cov
(base) lg1199@LG:~/testdata$ mkdir bootstrap_counts
(base) lg1199@LG:~/testdata$ fastspar_bootstrap --otu_table /home/lg1199/testdata/SpaD1.tsv --number 1000 --prefix bootstrap_counts/SpaD1

(base) lg1199@LG:~/testdata$ mkdir bootstrap_correlation
(base) lg1199@LG:~/testdata$ parallel fastspar --otu_table {} --correlation bootstrap_correlation/cor_{/} --covariance bootstrap_correlation/cov_{/} -i 5 ::: bootstrap_counts/*

bootstrap P值

从P值的计算过程可以知道，这是一种类似null model的方法，因此，这种P值与假设检验的P值是不一样的（详见《微生物beta deviation计算_零假设模型》，https://www.jianshu.com/p/d803257d405c）

fastspar_pvalues --otu_table /home/lg1199/testdata/SpaD1.tsv --correlation median_correlation.tsv --prefix bootstrap_correlation/cov_SpaD1_ --permutations 1000 --outfile pvalues

下面是正常运行的截图

Figure 1 计算cor cov

Figure 2 计算p值

Figure 3 导出Cor和P值矩阵

Refs:
https://github.com/scwatts/fastspar
https://academic.oup.com/bioinformatics/article/35/6/1064/5086389?login=false