monocle2在处理单细胞测序数据matrix时,要求指定数据的分布类型
包括负二项分布,泊松分布(不推荐),log-高斯分布
后来又看到zero-inflation negative binomial(ZINB)模型,考虑了UMI实验中大量的0值
具体可参考这篇文章:
https://zhuanlan.zhihu.com/p/95299303
后来看到Genome Biology这篇文章,看了个摘要:
Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.
作者认为不存在什么零膨胀,现在的normalization会导致错误选择HVGs,从而导致错误的降维,采用了GLM-PCA算法对非正态分布的数据进行降维。
算法本身千变万化已经不想太关注了,之前也曾尝试各种fancy的方法,但是最终结果大方向都是一致的。
看这篇文章是想关注一下所谓的UMI count 实验的数据分布问题。
2018年 Genome Biology提出用negative binomial model with independent dispersions 拟合UMI counts数据
Read counting and unique molecular identifier (UMI) counting are the principal gene expression quantification schemes used in single-cell RNA-sequencing (scRNA-seq) analysis. By using multiple scRNA-seq datasets, we reveal distinct distribution differences between these schemes and conclude that the negative binomial model is a good approximation for UMI counts, even in heterogeneous populations. We further propose a novel differential expression analysis algorithm based on a negative binomial model with independent dispersions in each group (NBID). Our results show that this properly controls the FDR and achieves better power for UMI counts when compared to other recently developed packages for scRNA-seq analysis.
Seurat包的sctransform函数,采用regularized negative binomial regression对UMI counts数据进行“Normalization”和“variance stabilization”(这两个名词需要明确定义)