利用 Dirichlet-multinomial regression 计算不同条件下亚群丰度变化

image.png

方法来源于上面这篇文章，不得不说，这篇文章运用了非常多复杂的方法去阐述关注的科学问题，真不愧是出自Broad institute实验室的。我这里暂时只讲下文章中一种比较新颖的比较不同条件下亚群丰度变化的方法。

首先我们先了解下Dirichlet-multinomial regression。
让我们从数学层面开始：
假设从正常组织取了sample i，正常组织本身包含了p种cell type，假设各种cell type出现的概率为

Cell probability

, sample i中各种cell type出现的数目为

Number of cell type

sample i中共有N个cell，全部细胞的总和就是，

Total number of cells

那么汇总下，这个sample的概率分布就为

其中

是 sample i 对应的多项分布的参数, 满足

对于多项分布, 其均值和方差均可由该参数决定, 即设 Xij 是相应位置的随机变量, 则有

引入Dirichlet概率分布

对于同一个tissue不同的sample，不同sample之间的成分参数

可能会相差很大，具有‘超散布性’ （overdispersion），也就是说观测到的不同样本中成分参数的方差会显著大于多项分布模型下给出的方差。如何在建模中考虑这种超散布性，并且不给模型增加太多参数呢？我们假设不同样本的成分参数来自同一随机变量的不同实现，那么我们可以使用Dirichlet分布来model 成分参数分布。

其中

是Dirichlet分布参数，满足

结合前面的multinomial distribution，我们可以写出 DM 模型的概率分布为

DM 模型的另一种等价的参数化方法是令

则概率分布可写为

这种参数化方式的好处是参数的含义更易解释, 因为有

因此，对于包含多个cell type的tissue sample data，可以通过fit DM 模型的方式对两个sample中各个cell type abundance进行统计检验。
上面发表在Cell上的这篇文章就是用到的就是DirichletReg这个包，来检验正常人组织 VS 病人组织non-inflamed sample间哪些cell type abundance存在明显差异，以及检验正常人组织 VS 病人组织inflamed sample间哪些cell type abundance存在明显差异。

这里是示例代码

# source code downloaded from:
# https://github.com/cssmillie/ulcerative_colitis
# there is a 'ulcerative_colitis-master' folder which contains a number of r code scripts

## load required libraries (analysis.r code includes all the libraries for running all analyses,
## but as here only shows the Cellular Composition difference test, only libraries listed below need to be loaded.
library(Seurat)
library(RColorBrewer) #for brewer.pal
library(Matrix) #for Matrix
library(DirichletReg)
library(data.table)
library(tidyverse)
library(cowplot)

## this function is extracted from analysis.r 
dirichlet_regression = function(counts, covariates, formula){  
  # Dirichlet multinomial regression to detect changes in cell frequencies
  # formula is not quoted, example: counts ~ condition
  # counts is a [samples x cell types] matrix
  # covariates holds additional data to use in the regression
  #
  # Example:
  # counts = do.call(cbind, tapply(seur@data.info$orig.ident, seur@ident, table))
  # covariates = data.frame(condition=gsub('[12].*', '', rownames(counts)))
  # res = dirichlet_regression(counts, covariates, counts ~ condition)
  
  #ep.pvals = dirichlet_regression(counts=ep.freq, covariates=ep.cov, formula=counts ~ condition)$pvals

  # Calculate regression
  counts = as.data.frame(counts)
  counts$counts = DR_data(counts)
  data = cbind(counts, covariates)
  fit = DirichReg(counts ~ condition, data)
  
  # Get p-values
  u = summary(fit)
  #compared with healthy condition, 15 vars. noninflame and inflame, 30pvalues
  pvals = u$coef.mat[grep('Intercept', rownames(u$coef.mat), invert=T), 4] 
  v = names(pvals)
  pvals = matrix(pvals, ncol=length(u$varnames))
  rownames(pvals) = gsub('condition', '', v[1:nrow(pvals)])
  colnames(pvals) = u$varnames
  fit$pvals = pvals
  
  fit
}



## not all **.r in analysis.r need to be 'source'.
## only these three below are required for cellular composition difference test.
source('mtx.r') #for sparse_cbind
source('plot.r') #for matrix_barplot
source('colors.r') #for set.colors

## Load metadata for discovery and validation cohorts
## data downloaded from: 
## https://portals.broadinstitute.org/single_cell/study/SCP259
meta = read.table('all.meta2.txt', sep='\t', header=T, row.names=1, stringsAsFactors=F)

# Read a list of cell subsets, including the group that each one belongs to
# Groups include: Epithelial, Endothelial, Fibroblasts, Glia, Myeloid, B, and T cells
cell_subsets = read.table('cell_subsets.txt', sep='\t', header=F, stringsAsFactors=F)

# load seurat objects
# data downloaded from:
# https://www.dropbox.com/sh/dn4gwdww8pmfebf/AACXYu8rda5LoLwuCZ8aZXfma?dl=0
   
epi.seur = readRDS('train.Epi.seur.rds')
fib.seur = readRDS('train.Fib.seur.rds')
imm.seur = readRDS('train.Imm.seur.rds')
  
# set counts matrices
epi.counts = epi.seur@assays[['RNA']]@counts
fib.counts = fib.seur@assays[['RNA']]@counts
imm.counts = imm.seur@assays[['RNA']]@counts
  

# Use dirichlet-multinomial regression to find significant changes in cell frequencies during disease
# ---------------------------------------------------------------------------------------------------

# Count each cell subset in every sample
epi.freq = as.matrix(as.data.frame.matrix(table(epi.seur@meta.data$Sample, epi.seur@meta.data$Cluster)))
fib.freq = as.matrix(as.data.frame.matrix(table(fib.seur@meta.data$Sample, fib.seur@meta.data$Cluster)))
imm.freq = as.matrix(as.data.frame.matrix(table(imm.seur@meta.data$Sample, imm.seur@meta.data$Cluster)))

# Combine counts into a single matrix
all.freq = sparse_cbind(list(epi.freq, fib.freq, imm.freq))

# For the validation cohort, we need to combine the replicate samples (because they are not independent)
# To construct a list of replicates, we remove "1", "2", "a", and "b" from the sample IDs
reps = gsub('[12]*[ab]*$', '', rownames(all.freq))
temp = as.matrix(data.frame(aggregate(as.matrix(all.freq), list(reps), sum), row.names=1))
colnames(temp) = colnames(all.freq)
all.freq = temp[,colSums(temp) > 0]

# Split matrix into "epithelial" and "lamina propria" cell subsets and samples
ep.ident = levels(epi.seur@active.ident)
lp.ident = c(levels(fib.seur@active.ident), levels(imm.seur@active.ident))
ep.freq = all.freq[grep('Epi', rownames(all.freq)), ep.ident]
lp.freq = all.freq[grep('LP', rownames(all.freq)), lp.ident]

# For the dirichlet-multinomial regression, we need to know the disease state for each sample
# We can get this from the metadata table as follows:
sample2health = data.frame(unique(data.frame(sample=gsub('[12]*[ab]*$', '', meta[,'Sample']), health=meta[,'Health'])), row.names=1)
ep.cov = data.frame(condition=factor(sample2health[rownames(ep.freq),1], levels=c('Healthy', 'Non-inflamed', 'Inflamed')), row.names=rownames(ep.freq))
lp.cov = data.frame(condition=factor(sample2health[rownames(lp.freq),1], levels=c('Healthy', 'Non-inflamed', 'Inflamed')), row.names=rownames(lp.freq))

# Calculate significant changes using dirichlet multinomial regression
# This returns a matrix of p-values for each cell type / disease state
# 3 condition
ep.pvals = dirichlet_regression(counts=ep.freq, covariates=ep.cov, formula=counts ~ condition)$pvals
colnames(ep.pvals) = colnames(ep.freq)
lp.pvals = dirichlet_regression(counts=lp.freq, covariates=lp.cov, formula=counts ~ condition)$pvals
colnames(lp.pvals) = colnames(lp.freq)

# Plot epithelial cell proportions
ep.pct = 100*ep.freq/rowSums(ep.freq)
p1 = matrix_barplot(ep.pct, group_by=ep.cov$condition, pvals=ep.pvals, colors=set.colors)
save_plot(p1, file='train.Fig2A.epi_freqs.pdf', nrow=1, ncol=2.5)

# Plot lamina propria cell proportions
lp.pct = 100*lp.freq/rowSums(lp.freq)
p2 = matrix_barplot(lp.pct, group_by=lp.cov$condition, pvals=lp.pvals, colors=set.colors)
save_plot(p2, file='train.Fig2A.lp_freqs.pdf', nrow=1, ncol=2.5)

用自己的数据计算下

My scRNA-seq data

总结下，这篇文章运用了非常多计算方法，当然自己原创的部分也占据了很大部分，有必要好好学习下。

参考1：https://zhuanlan.zhihu.com/p/341941329
参考2：https://www.stat-center.pku.edu.cn/docs/20190226150810864721.pdf
原文：https://pubmed.ncbi.nlm.nih.gov/31348891/