前情回顾
Seurat 4.0 ||单细胞多模态数据整合算法WNN
Seurat 4.0 ||单细胞数据分析工具箱有更新
得益于单细胞多模态技术的发展,允许我们在单细胞水平从不同侧面考察细胞状态,如CITE-seq技术可以同时对单细胞转录组和膜蛋白进行定量。单细胞多模态数据分析面临的挑战和Seurat给出的解决方案,在预印本文章Integrated analysis of multimodal single-cell data中进行了介绍(中译版已由单细胞天地于近日推送)。算法落实到实现层面,我们来学习一下WNN的几篇教程,本文介绍了用于分析多模态单细胞数据集的加权最近邻(WNN)工作流程。该流程由三个步骤组成:
- 每个模态独立的预处理和降维
- 学习细胞特定的模态“权重”,并构建一个集成了这些模态的WNN图
- WNN图的下游分析(如可视化、聚类等)
我们使用来自(Stuart, Butler等人,Cell 2019)的CITE-seq数据集,该数据集包含30,672个scRNA-seq 数据及25个抗体数据。研究对象包括RNA和抗体衍生标签(ADT)两种检测方法。
要运行本教程,请安装Seurat v4,在github页面上有beta版本。
remotes::install_github("satijalab/seurat", ref = "release/4.0.0")
如下载入我们的老朋友:
library(Seurat)
library(SeuratData)
library(cowplot)
library(dplyr)
下载示例数据集
InstallData("bmcite")
bm <- LoadData(ds = "bmcite")
照例我们看看数据是怎么样的:
bm
An object of class Seurat
17034 features across 30672 samples within 2 assays
Active assay: RNA (17009 features, 2000 variable features)
1 other assay present: ADT
1 dimensional reduction calculated: spca
我们看到这个Seurat对象有两个 assays每个 assays 都有相应的表达谱。
dim(bm@assays$RNA)
[1] 17009 30672
bm@assays$ADT@counts[1:4,1:4]
4 x 4 sparse Matrix of class "dgCMatrix"
a_AAACCTGAGCTTATCG-1 a_AAACCTGAGGTGGGTT-1 a_AAACCTGAGTACATGA-1 a_AAACCTGCAAACCTAC-1
CD11a 110 541 120 645
CD11c 51 12 10 42
CD123 14 5 1 5
CD127-IL7Ra 20 187 43 115
RNA的assay我们是很熟悉了,ADT数据也是一个表达谱,包含的抗体:
rownames(bm@assays$ADT)
[1] "CD11a" "CD11c" "CD123" "CD127-IL7Ra" "CD14" "CD16" "CD161" "CD19" "CD197-CCR7" "CD25" "CD27"
[12] "CD278-ICOS" "CD28" "CD3" "CD34" "CD38" "CD4" "CD45RA" "CD45RO" "CD56" "CD57" "CD69"
[23] "CD79b" "CD8a" "HLA.DR"
我们首先分别对这两种数据进行预处理和降维。我们使用标准的标准化,但您也可以使用SCTransform或任何替代方法。
DefaultAssay(bm) <- 'RNA'
bm <- NormalizeData(bm) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()
DefaultAssay(bm) <- 'ADT'
# we will use all ADT features for dimensional reduction
# we set a dimensional reduction name to avoid overwriting the
VariableFeatures(bm) <- rownames(bm[["ADT"]])
bm <- NormalizeData(bm, normalization.method = 'CLR', margin = 2) %>%
ScaleData() %>% RunPCA(reduction.name = 'apca')
Normalizing across cells
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s
Centering and scaling data matrix
|================================================================================================================================================================| 100%
Warning in irlba(A = t(x = object), nv = npcs, ...) :
You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = npcs, ...) :
did not converge--results might be invalid!; try increasing work or maxit
PC_ 1
Positive: CD3, CD28, CD27, CD127-IL7Ra, CD278-ICOS, CD4, CD8a, CD161, CD25, CD45RO
CD45RA, CD197-CCR7
Negative: HLA.DR, CD11c, CD14, CD38, CD69, CD123, CD11a, CD34, CD19, CD79b
CD16, CD56
PC_ 2
Positive: CD45RO, CD11a, CD14, CD4, CD11c, CD28, CD3, CD69, CD278-ICOS, CD127-IL7Ra
CD38, CD161
Negative: CD45RA, CD19, CD79b, CD197-CCR7, CD8a, CD57, CD56, CD16, CD27, CD25
HLA.DR, CD34
PC_ 3
Positive: CD16, CD56, CD8a, CD11a, CD57, CD161, CD45RA, CD38, CD3, CD127-IL7Ra
CD27, CD11c
Negative: CD19, CD79b, CD4, CD25, HLA.DR, CD197-CCR7, CD69, CD34, CD28, CD45RO
CD14, CD278-ICOS
PC_ 4
Positive: CD161, CD25, CD45RO, CD56, CD16, CD4, CD79b, CD19, CD28, CD57
CD127-IL7Ra, HLA.DR
Negative: CD8a, CD27, CD14, CD45RA, CD3, CD11c, CD69, CD38, CD197-CCR7, CD11a
CD34, CD123
PC_ 5
Positive: CD4, CD38, CD123, CD16, CD34, CD45RA, CD56, CD278-ICOS, CD28, CD197-CCR7
CD27, CD3
Negative: CD45RO, CD8a, CD69, CD161, CD79b, CD19, CD25, CD57, CD14, CD11a
CD127-IL7Ra, HLA.DR
Warning messages:
1: Requested number is larger than the number of available items (25). Setting to 25.
2: Requested number is larger than the number of available items (25). Setting to 25.
3: Requested number is larger than the number of available items (25). Setting to 25.
4: Requested number is larger than the number of available items (25). Setting to 25.
5: Requested number is larger than the number of available items (25). Setting to 25.
6: Cannot add objects with duplicate keys (offending key: PC_), setting key to 'apca_'
随着单细胞多模态技术的发展,我们要习惯一个对象有多个assay的数据结构。数据结构就像一处盖好的大楼,房间都在那了,就是往数据里面填我们计算好的数据,比如这里的apca。
head(bm@reductions$apca@cell.embeddings)
apca_1 apca_2 apca_3 apca_4 apca_5 apca_6 apca_7 apca_8 apca_9 apca_10 apca_11 apca_12 apca_13
a_AAACCTGAGCTTATCG-1 -1.546214 -1.0560783 -0.3193425 0.5561288 1.1332821 -3.46190907 -0.6363978 -0.69711363 -0.15623767 -0.1291178 0.43826804 0.4848001 0.02257897
a_AAACCTGAGGTGGGTT-1 2.698538 1.5314222 2.2152152 3.8489785 -0.2302644 0.41040618 1.7093026 -2.08814571 1.93091407 1.2356946 0.21717667 -0.6694691 -1.46231202
a_AAACCTGAGTACATGA-1 3.287854 0.2591767 0.1191195 -0.7199622 1.7688191 0.01789973 -1.2586785 0.08800729 0.42980044 0.9671439 -0.73346613 -0.3946358 -1.28168949
a_AAACCTGCAAACCTAC-1 3.100410 2.3937004 -0.9758685 1.0520679 0.0893280 0.01760703 -0.4626806 0.44875816 0.70985188 0.6547350 -0.12642028 0.2164491 1.06710170
a_AAACCTGCAAGGTGTG-1 -3.707086 2.5657331 -0.3487033 -0.6951109 -0.7601989 0.20388700 -0.7173617 -0.12882660 -0.08661708 0.3569946 -0.10704170 -0.5074276 -0.07922931
a_AAACCTGCACGGTAGA-1 -2.627561 -3.7290142 -2.2912037 0.5042828 -0.3032089 0.12050431 0.3047013 -0.01124192 0.74586469 -0.4926128 0.04005597 0.3078714 0.62619888
apca_14 apca_15 apca_16 apca_17 apca_18 apca_19 apca_20 apca_21 apca_22 apca_23 apca_24
a_AAACCTGAGCTTATCG-1 0.03922873 0.284971458 -0.05987775 -0.63780150 -0.3893929 0.72669647 0.4473322 0.30682227 -0.56594325 -0.14519199 0.26113206
a_AAACCTGAGGTGGGTT-1 -1.26101465 0.727477850 -1.33392808 0.06360178 -0.2847066 0.47726469 -0.2159961 -0.75962725 0.31336909 0.27562030 -0.46558363
a_AAACCTGAGTACATGA-1 -0.09693877 -0.060252586 0.21063702 -0.31319328 0.2271467 0.17773498 0.3805679 0.28309195 0.15551460 -0.01404676 -0.12147257
a_AAACCTGCAAACCTAC-1 0.09764741 -0.001231692 0.34339056 -0.05304418 0.1969659 0.05418865 -0.2202433 -0.16040501 -0.12525310 0.05894026 -0.03718946
a_AAACCTGCAAGGTGTG-1 0.24172883 0.040471346 0.24044018 0.34446924 -0.1282743 -0.11257312 -0.2512336 0.18092572 0.23236869 -0.10883569 0.05730309
a_AAACCTGCACGGTAGA-1 -0.18285025 0.911541345 0.20308131 -0.81552469 -0.5860654 0.25639653 0.3117227 -0.06062992 -0.07510367 -0.29029817 -0.01214964
对于每个细胞,我们根据RNA和蛋白质相似度的加权组合来计算它在数据集中最近的邻居。细胞特有的模态权重和多模态邻居在一个函数中计算,该函数在此数据集上运行约2分钟。我们指定每个模态的维数(类似于指定要包含在scRNA-seq集群中的pc的数量),但是您可以改变这些设置,以看到小的更改对总体结果的影响最小。这是Seurat 4.0 最大的一个更新,一个函数整合多模态数据。所谓的模态在我们数据分析中就是多了一个表格与同模态不同的是,这里的数据来源不同。在使用之前我们先看一下这个函数的文档?FindMultiModalNeighbors
。
?FindMultiModalNeighbors
# Identify multimodal neighbors. These will be stored in the neighbors slot,
# and can be accessed using bm[['weighted.nn']]
# The WNN graph can be accessed at bm[["wknn"]],
# and the SNN graph used for clustering at bm[["wsnn"]]
# Cell-specific modality weights can be accessed at bm$RNA.weight
bm <- FindMultiModalNeighbors(
bm, reduction.list = list("pca", "apca"),
dims.list = list(1:30, 1:18), modality.weight.name = "RNA.weight"
)
可以看出整合多模态是在各自的降维空间里进行的,可以指定降维方法和维度。
Calculating cell-specific modality weights
Finding 20 nearest neighbors for each modality.
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=18s
Calculating kernel bandwidths
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s
Finding multimodal neighbors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 07s
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s
Constructing multimodal KNN graph
Constructing multimodal SNN graph
我们来看看这个加权矩阵有哪些内容。
str(bm@neighbors$weighted.nn)
Formal class 'Neighbor' [package "Seurat"] with 5 slots
..@ nn.idx : num [1:30672, 1:20] 21408 26747 2354 14276 2344 ...
..@ nn.dist : num [1:30672, 1:20] 0.146 0.202 0.297 0.336 0.301 ...
..@ alg.idx : NULL
..@ alg.info : list()
..@ cell.names: chr [1:30672] "a_AAACCTGAGCTTATCG-1" "a_AAACCTGAGGTGGGTT-1" "a_AAACCTGAGTACATGA-1" "a_AAACCTGCAAACCTAC-1" ...
head(bm@neighbors$weighted.nn@nn.idx)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] 21408 24052 3638 22480 24855 21388 12916 21993 15870 2248 20577 15011 27793 16535 12400 29694 30082 8634 15448 20145
[2,] 26747 7037 3857 10338 8055 25281 14106 28950 10337 9524 17717 12946 29411 7192 638 14297 27010 17611 7282 8051
[3,] 2354 9333 3422 8414 22000 30077 993 29367 19518 8283 20360 4269 17634 8542 9665 8855 12444 3587 14110 1120
[4,] 14276 28818 12742 11694 27019 16333 809 24540 27160 22614 28896 3774 1841 18030 5055 25481 23135 11424 22303 21478
[5,] 2344 7799 6268 12324 13830 4142 7393 10436 12439 27303 19328 25339 24713 5597 14001 20544 19067 12394 4108 28421
[6,] 5797 21480 12600 24915 21105 2197 18431 23934 22306 19231 11036 1687 24800 23613 24455 19422 14841 17582 29698 28724
WNN图结构被存在bm[["wknn"]]
,用于聚类的SNN图在bm[["wsnn"]]
,既然是图,我们不免想起要用igraph探索一番。以bm[["wknn"]]
为例。
library(igraph)
net <- graph_from_adjacency_matrix(bm[["wknn"]])
net
net
IGRAPH 2231c24 DN-- 30672 1014296 --
+ attr: name (v/c)
+ edges from 2231c24 (vertex names):
[1] a_AAACCTGAGCTTATCG-1->a_AAACCTGAGCTTATCG-1 a_AGGCCACTCTTCATGT-1->a_AAACCTGAGCTTATCG-1 a_CACAGTAAGTATTGGA-1->a_AAACCTGAGCTTATCG-1
[4] a_CTGTTTACACCTCGGA-1->a_AAACCTGAGCTTATCG-1 a_GATGCTAGTCAATGTC-1->a_AAACCTGAGCTTATCG-1 a_TCGGGACGTGACTACT-1->a_AAACCTGAGCTTATCG-1
+ ... omitted several edges
这是一个有向图,节点数是30672 (细胞数),边数量为1014296 。
is_weighted(net)
[1] FALSE
is_simple(net)
[1] FALSE
edge_density(net)
[1] 0.001078188
transitivity(net)
[1] 0.1941536
reciprocity(net, mode="default")
[1] 1
is_connected(net)
[1] TRUE
comps <- decompose(net)
table(sapply(comps, vcount))
30672
1
边的分布:
d.net <- degree(net)
hist(d.net,col="blue",
xlab="Degree", ylab="Frequency",
main="Degree Distribution")
取子图,主要是这么多节点不好画图。
par(mfrow=c(1, 2))
plot(subnet)
visualize_graph( subnet, centrality.type="Markov Centrality")
可以看出有的细胞有许多链接,但是大部分细胞和其他的没有什么链接。
然后,我们基于这个网络,来用igraph的cluster_fast_greedy
聚类。
# igraph::as.undirected()
kc <- cluster_fast_greedy( as.undirected(net)) #
sizes(kc)
Community sizes
1 2 3 4 5 6 7 8
76 377 370 9532 23 3997 8088 8209
聚成了8个类,看每个细胞的归宿。
head(membership(kc))# 社团归宿 )
a_AAACCTGAGCTTATCG-1 a_AAACCTGAGGTGGGTT-1 a_AAACCTGAGTACATGA-1 a_AAACCTGCAAACCTAC-1 a_AAACCTGCAAGGTGTG-1 a_AAACCTGCACGGTAGA-1
4 8 7 7 4 6
对聚类结果做个可视化
library(ape)
dendPlot(kc, mode="phylo")
好了,结束我们对数据结构的探索。这不是教程的一部分,而是我们学了网络图之后,探究起细胞组成的网络是可以有许多自己的视角。我们现在可以使用这些结果进行下游分析,比如可视化和聚类。例如,我们可以基于RNA和蛋白质数据的加权组合创建数据的UMAP可视化。我们还可以执行基于图形的聚类,并在UMAP上可视化这些结果,以及一组细胞注释。
bm <- RunUMAP(bm, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
bm <- FindClusters(bm, graph.name = "wsnn", algorithm = 3, resolution =seq( 0,2,0.4), verbose = FALSE)
head(bm@meta.data)
orig.ident nCount_RNA nFeature_RNA nCount_ADT nFeature_ADT lane donor celltype.l1 celltype.l2 RNA.weight wsnn_res.2
a_AAACCTGAGCTTATCG-1 bmcite 7546 2136 1350 25 HumanHTO4 batch1 Progenitor cells Prog_RBC 0.4827011 19
a_AAACCTGAGGTGGGTT-1 bmcite 1029 437 2970 25 HumanHTO1 batch1 T cell gdT 0.2417890 10
a_AAACCTGAGTACATGA-1 bmcite 1111 429 2474 23 HumanHTO5 batch1 T cell CD4 Naive 0.5077136 1
a_AAACCTGCAAACCTAC-1 bmcite 2741 851 4799 25 HumanHTO3 batch1 T cell CD4 Memory 0.4313079 4
a_AAACCTGCAAGGTGTG-1 bmcite 2099 843 5434 25 HumanHTO2 batch1 Mono/DC CD14 Mono 0.5685085 2
a_AAACCTGCACGGTAGA-1 bmcite 2291 783 4658 25 HumanHTO6 batch1 B cell Naive B 0.4255879 5
seurat_clusters wsnn_res.0 wsnn_res.0.4 wsnn_res.0.8 wsnn_res.1.2 wsnn_res.1.6
a_AAACCTGAGCTTATCG-1 19 0 9 8 18 19
a_AAACCTGAGGTGGGTT-1 10 0 10 9 9 9
a_AAACCTGAGTACATGA-1 1 0 1 1 0 1
a_AAACCTGCAAACCTAC-1 4 0 3 3 4 4
a_AAACCTGCAAGGTGTG-1 2 0 0 0 1 2
a_AAACCTGCACGGTAGA-1 5 0 4 4 5 5
注意,这里有一列RNA.weight
是指在RNA和膜蛋白中RNA的权重,而Protein weight is 1 - RNA weight.
library(clustree)
clustree(bm,prefix = 'wsnn_res.')
p1 <- DimPlot(bm, reduction = 'wnn.umap', label = TRUE, repel = TRUE, label.size = 2.5) + NoLegend()
p2 <- DimPlot(bm, reduction = 'wnn.umap', group.by = 'celltype.l2', label = TRUE, repel = TRUE, label.size = 2.5) + NoLegend()
p1 + p2
我们也可以计算UMAP可视化仅基于RNA和蛋白质数据和比较。我们发现,在识别祖细胞状态方面,RNA分析比ADT分析提供的信息更多(ADT panel包含分化细胞的标记),而T细胞状态则相反(ADT分析优于RNA分析)。
bm <- RunUMAP(bm, reduction = 'pca', dims = 1:30, assay = 'RNA',
reduction.name = 'rna.umap', reduction.key = 'rnaUMAP_')
bm <- RunUMAP(bm, reduction = 'apca', dims = 1:18, assay = 'ADT',
reduction.name = 'adt.umap', reduction.key = 'adtUMAP_') # 这个dims选择的有技术含量啊
p3 <- DimPlot(bm, reduction = 'rna.umap', group.by = 'celltype.l2', label = TRUE,
repel = TRUE, label.size = 2.5) + NoLegend()
p4 <- DimPlot(bm, reduction = 'adt.umap', group.by = 'celltype.l2', label = TRUE,
repel = TRUE, label.size = 2.5) + NoLegend()
p3 + p4
我们可以可视化多模态UMAP上典型标记基因和蛋白的表达,这有助于验证所提供的注释:
p5 <- FeaturePlot(bm, features = c("adt_CD45RA","adt_CD16","adt_CD161"),
reduction = 'wnn.umap', max.cutoff = 2,
cols = c("lightgrey","darkgreen"), ncol = 3)
p6 <- FeaturePlot(bm, features = c("rna_TRDC","rna_MPO","rna_AVP"),
reduction = 'wnn.umap', max.cutoff = 3, ncol = 3)
p5 / p6
最后,我们可以将每个细胞的模态权重可视化。RNA重量最高的每个群体代表祖细胞,而蛋白重量最高的群体代表T细胞。这符合我们的生物学预期,因为抗体panel不包含可以区分不同祖细胞群体的标记。
VlnPlot(bm, features = "RNA.weight", group.by = 'celltype.l2', sort = TRUE, pt.size = 0) +
NoLegend()
https://satijalab.org/seurat/v4.0/weighted_nearest_neighbor_analysis.html
https://github.com/satijalab/seurat/issues/3670