single cell
clustering
Key Point
- scRNA数据分析聚类策略选择
- 聚类的Technical, Biological, Computational挑战
- 聚类的生物学意义解释
写在前面的话
- 流式也是一种单细胞的技术,不同的是流式通过细胞的表面蛋白对细胞类群进行鉴定,而scRNA-seq对单个细胞的表达谱进行定量,通过Top基因的表达对细胞类群进行鉴定。
- 为什么要聚类?基于表达谱的聚类是一种无监督的数据驱动,无偏的方法;利用聚类可以对细胞类型进行划分,对研究细胞异质性,发育,进化相关有很大的帮助
- 很多聚类方法有潜在的假设,即数据中存在离散的cluser;但是对一些细胞发育谱系来说,可能需要考虑进化轨迹的问题,cluster之间存在时间上的关系。
文献正文
聚类策略
scRNA-seq 表达谱矩阵特点:
- 高维(上万个基因表达)
- 稀疏(基因的表达值为0或接近0)
聚类中距离的计算:
- 使用所有的feature,即基因,容易落入'curse of dimensionality',使得距离倾向于更小
- 特征选择和降维,使用一些基因组成的特征空间,比如PCA降维
可以使用Euclidean distance, cosine similarity, Pearson's similarity, Pearson's correlation 和 Spearman's correlation。后三个计算方法考虑值之间的相对差异,使得它们对library or cell size差异更加鲁棒。
常用的聚类的方法k-means,计算复杂度随点的数目线性增加,然而①k-means通常是贪婪算法,容易陷入局部最优,需要重复多次不同初始参数条件或者像SC3上游处理,发现consensus;②bias towards identifying equal-sized clusters,导致忽略稀有细胞类型。
另外一个常用方法是层次聚类,自上而下或自下而上,但是其time and memory consuming,随着数据点的增加而呈现二次方增长。
另外一个常用的聚类方法是community-detection-based 算法,或者说是图算法。首先其建立一个k-nearest neighbours graph,其中K的选择对最终cluster的大小和数目影响很大。大多数基于图的聚类方法只返回一个最优解,而且其不用指定cluster的数目。
Name | Year | Method type | Strengths | Limitations |
---|---|---|---|---|
scanpy 4 | 2018 | PCA + graph-based | Very scalable | May not be accurate for small data sets |
Seurat (latest)3 | 2016 | PCA + graph-based | Very scalable | May not be accurate for small data sets |
PhenoGraph32 | 2015 | PCA + graph-based | Very scalable | May not be accurate for small data sets |
SC3 22 | 2017 | PCA + k-means | High accuracy through consensus, provides estimation of k | High complexity, not scalable |
SIMLR 24 | 2017 | Data-driven dimensionality reduction + k-means | Concurrent training of the distance metric improves sensitivity in noisy data sets | Adjusting the distance metric to make cells fit the clusters may artificially inflate quality measures |
CIDR 25 | 2017 | PCA + hierarchical | Implicitly imputes dropouts when calculating distances | |
GiniClust 75 | 2016 | DBSCAN | Sensitive to rare cell types | Not effective for the detection of large clusters |
pcaReduce 27 | 2016 | PCA + k-means + hierarchical | Provides hierarchy of solutions | Very stochastic, does not provide a stable result |
Tasic et al.28 | 2016 | PCA + hierarchical | Cross validation used to perform fuzzy clustering | High complexity, no software package available |
TSCAN 41 | 2016 | PCA + Gaussian mixture model | Combines clustering and pseudotime analysis | Assumes clusters follow multivariate normal distribution |
mpath 45 | 2016 | Hierarchical | Combines clustering and pseudotime analysis | Uses empirically defined thresholds and a priori knowledge |
BackSPIN 26 | 2015 | Biclustering (hierarchical) | Multiple rounds of feature selection improve clustering resolution | Tends to over-partition the data |
RaceID23, RaceID2115, RaceID3 | 2015 | k-Means | Detects rare cell types, provides estimation of k | Performs poorly when there are no rare cell types |
SINCERA 5 | 2015 | Hierarchical | Method is intuitively easy to understand | Simple hierarchical clustering is used, may not be appropriate for very noisy data |
SNN-Cliq 80 | 2015 | Graph-based | Provides estimation of k | High complexity, not scalable |
- DBSCAN, density-based spatial clustering of applications with noise; PCA, principal component analysis; scRNA-seq, single-cell RNA sequencing.
Discrete versus continuous cell grouping
大多数划分聚类的算法会忽略是否存在生物学有意义的群,如果数据中没有离散的群存在的话,这些方法可能就不是很适用。特别是细胞处于连续的状态,比如分化,这时常用one dimensional manifold('pseudotime') to order the cells.
Technical challenges
- more dropouts, 可能原因:没有表达;测序深度低;建库时没有捕获到转录本
目前有一些统计方法to impute zeros。 - 估计technical noise,使用内源性spike-in RNA,作为阳性对照
- batch effect, 批次效应,最好的避免方法是平衡实验设计
还需要考虑在建库时的RNA降解的问题
doublets (droplets containing two cells)
一些高表达的基因比如ribosomal genes也会对聚类有影响
Biological challenges
cell-cycle, scLVM和cyclone可以处理这些问题
rare cell type鉴定,分治的策略,但是大cluster要不要继续分又是一个问题。
Computational challenges
高维
线性降维:PCA
非线性降维:tSNE和UMAP
参数的选择,比如k-means中k的选择以及基于图的算法中k阶近邻中k的选择
如何验证方法的有效性,及golden standard dataset的建立
- tissues that are very well studied and understood 或者 considering cells taken from the earliest stages of embryonic development
- many of the suitable data sets are quite small, making it difficult to test methods at the kinds of scale that are relevant for current experiments
可以借助实验的方法,spatial methods,比如FISH,RNAscope等作为验证。
生物学解释和注释
如何对划分的类打标签,这是个很难的问题。与流式基于细胞表面的蛋白类似,scRNA-seq将cluster中高表达的基因作为marker基因,通过查文献,数据库等方式对cluster进行打标签。
或者借助GO富集分析,这里急需一个Cell Ontology的DataBase
新的scRNA-seq数据如何以往数据进行整合,这里需要考虑batch effect的问题。
整合的是可以①先对表达矩阵进行merge再进行聚类分析;②或者类似进行blast的功能,给一个cell的表达矩阵,找到它最近的邻居。
其实除了RNA水平,还有其它水平的数据,即多组学数据,可以更好的帮助我们进行cell type identification。还有实验水平的空间染色方法,可以帮助我们验证分群的好坏。