2020年3月26日,浙江大学郭国骥团队在nature发表了人类单细胞水平的细胞景观。鉴定了60种人体组织样品涵盖八大系统的人类细胞图谱分析涉及人体100余种细胞大类和800余种细胞亚类。建立了70多万个单细胞的转录组数据库,基于该数据库,团队开发了scHCL单细胞比对系统用于人体细胞类型的识别,并搭建了人类细胞蓝图网站HCL(国家基因库镜像HCL)。
这么大规模的单细胞测序,细胞类型的鉴定肯定是一道亮丽的技术风景。同时,人类的细胞图谱的绘制对以后人类细胞类型的鉴定也定有他独特的借鉴意义。那么,文章中是如何鉴定的呢?
文章中有更为详细的描述,实现细节在R包https://github.com/ggjlab/scHCL中以及在线http://bis.zju.edu.cn/HCL/blast.html中。在线版的界面如下:
界面清晰明了。我们用scHCL来尝试做一下细胞类型的鉴定。
library(scHCL)
library(Seurat)
library(tidyverse)
# hcl_lung is an example expression matrix from HCL project.
> data(hcl_lung)
> dim(hcl_lung)
[1] 2884 80
# 2884 genes expression value of 80 cells
# scHCL has two parameters , single cell expression matrix(scdata) and
# the number of most similar cell types
> hcl_result <- scHCL(scdata = hcl_lung, numbers_plot = 3)
head(hcl_result$scHCL_probility)
Cell Cell type Score
1 E18_2_C06 Stratified epithelial cell (Adult-Trachea2) 0.07423052
2 E18_2_C06 Epithelial cell_MSMB high (Adult-Trachea2) 0.06744750
3 E18_2_C06 Basal cell_KRT6A high (Adult-Trachea2) 0.07889934
4 E18_2_C07 Zona fasciculata cell (Fetal-Adrenal-Gland2) 0.05358975
5 E18_2_C07 Intra-adrenal chromoblast (Fetal-Adrenal-Gland4) 0.05475451
6 E18_2_C07 Unknown (Fetal-Muscle1) 0.05533774
每个细胞亚群有三种可能,当然我们需要深入的探索一下,于是推出了shiny程序:
# open shiny for visualize result for scHCL
scHCL_vis(hcl_result)
那如何使用我们自己的数据呢?
我们先看一看,R包自带的参考数据集的信息。
rm(list=ls())
library(scHCL)
load("C:\\Users\\Administrator\\Desktop\\scHCL-master\\data\\ref.expr.rda")
ref.expr[1:4,1:4]
M2 Macrophage(Adult-Adipose1) Stromal cell(Adult-Adipose1) Adipocyte_SPP1 high(Adult-Adipose1) Mast cell(Adult-Adipose1)
CD163 409.0000 162.00000 285 315
F13A1 237.0000 75.00000 52 190
STAB1 172.3333 56.33333 66 164
VSIG4 188.3333 52.33333 93 113
dim(ref.expr)
[1] 6075 1841
colnames(ref.expr)[1:4]
[1] "M2 Macrophage(Adult-Adipose1)" "Stromal cell(Adult-Adipose1)" "Adipocyte_SPP1 high(Adult-Adipose1)" "Mast cell(Adult-Adipose1)"
然后我们读入自己的数据:
met <- 'F:\\scOrange\\example\\GSM3587965_AML556-D15.dem.txt\\AML556-D15.dem.txt'
met <-readr::read_tsv(met)
Parsed with column specification:
cols(
.default = col_double(),
Gene = col_character()
)
See spec(...) for full column specifications.
|=============================================================================================================================================| 100% 64 MB
met[1:4,1:4]
# A tibble: 4 x 4
Gene `AML556-D15_AAAAAGGTGCTN` `AML556-D15_AAAAAGGTTTCT` `AML556-D15_AAAACCTCGGTN`
<chr> <dbl> <dbl> <dbl>
1 A1BG 0 0 0
2 A1BG-AS1 0 0 0
3 A1CF 0 0 0
4 A2M 0 0 0
met <- as.data.frame(met)
rownames(met) <- met$Gene
met <- met[,-1]
met[1:4,1:4]
AML556-D15_AAAAAGGTGCTN AML556-D15_AAAAAGGTTTCT AML556-D15_AAAACCTCGGTN AML556-D15_AAACACTAAGGN
A1BG 0 0 0 0
A1BG-AS1 0 0 0 0
A1CF 0 0 0 0
A2M 0 0 0 0
>
运行scHCL预测。
也可以先运行聚类求平均表达谱,用平均表达谱来做。这里由于我不知道分群多少合适,就直接用表达谱做了(这样比较慢)。
que.seu <-CreateSeuratObject(met)
que.seu_suerat <- scHCL(scdata = que.seu@assays$RNA, numbers_plot = 3)
我们来看看预测的结果:
que.seu_suerat$cors_matrix[1:4,1:4]
AML556-D15_AAAAAGGTGCTN AML556-D15_AAAAAGGTTTCT AML556-D15_AAAACCTCGGTN AML556-D15_AAACACTAAGGN
M2 Macrophage (Adult-Adipose1) 0.08364598 0.04890905 0.04529126 0.10508757
Stromal cell (Adult-Adipose1) 0.05262400 0.03501448 0.02367288 0.06850328
Adipocyte_SPP1 high (Adult-Adipose1) 0.06046088 0.04372272 0.04175932 0.09665644
Mast cell (Adult-Adipose1) 0.07187894 0.06195473 0.04336573 0.10477821
> que.seu_suerat$top_cors
[1] 3
> head(que.seu_suerat$scHCL_probility,5)
Cell Cell type Score
1 AML556-D15_AAAAAGGTGCTN T cells2 (Placenta_VentoTormo) 0.2451411
2 AML556-D15_AAAAAGGTGCTN Blood NK CD16+ (Placenta_VentoTormo) 0.2206799
3 AML556-D15_AAAAAGGTGCTN T cells3 (Placenta_VentoTormo) 0.2256222
4 AML556-D15_AAAAAGGTTTCT T cells2 (Placenta_VentoTormo) 0.2001931
5 AML556-D15_AAAAAGGTTTCT T cells3 (Placenta_VentoTormo) 0.1936185
每个细胞numbers_plot =3 ,所以每个细胞展示3种celltype。
scHCL_vis(que.seu_suerat)
在单细胞同量那么高的情况下,其实人们没时间去鉴定每一个细胞的类型,而这里还给每个细胞好几种类型。其实可先分群做成一个平均表达谱,再mapping到bulk的表达谱上,以鉴定亚群的细胞类型。这里作为探索,我们取Score值最值,在某个Score范围内注释到的celltype最多的。我们先退一步根据每个细胞的类型推测分群的celltype。
que.seu_suerat <- scHCL(scdata = que.seu@assays$RNA, numbers_plot = 1)
> que.seu_suerat$cors_matrix[1:4,1:4]
AML556-D15_AAAAAGGTGCTN AML556-D15_AAAAAGGTTTCT AML556-D15_AAAACCTCGGTN AML556-D15_AAACACTAAGGN
M2 Macrophage (Adult-Adipose1) 0.08364598 0.04890905 0.04529126 0.10508757
Stromal cell (Adult-Adipose1) 0.05262400 0.03501448 0.02367288 0.06850328
Adipocyte_SPP1 high (Adult-Adipose1) 0.06046088 0.04372272 0.04175932 0.09665644
Mast cell (Adult-Adipose1) 0.07187894 0.06195473 0.04336573 0.10477821
> que.seu_suerat$top_cors
[1] 1
> head(que.seu_suerat$scHCL_probility,5)
Cell Cell type Score
1 AML556-D15_AAAAAGGTGCTN T cells2 (Placenta_VentoTormo) 0.2451411
2 AML556-D15_AAAAAGGTTTCT T cells2 (Placenta_VentoTormo) 0.2001931
3 AML556-D15_AAAACCTCGGTN T cells2 (Placenta_VentoTormo) 0.2413753
4 AML556-D15_AAACACTAAGGN T cells3 (Placenta_VentoTormo) 0.2812856
5 AML556-D15_AAACAGCTACAA T cells3 (Placenta_VentoTormo) 0.2290333
>
执行seurat标准流程:
rownames(que.seu_suerat$scHCL_probility) <- que.seu_suerat$scHCL_probility$Cell
que.seu %>% FindVariableFeatures() %>%
ScaleData() %>%
RunPCA() %>%
RunUMAP(dim=1:30) %>%
FindNeighbors() %>%
FindClusters(resolution =c(0.2,0.4,0.6,0.8))->que.seu # 制定多种分群结果。
que.seu@meta.data$Cell <- rownames(que.seu@meta.data)
merge(que.seu@meta.data,que.seu_suerat$scHCL_probility,by= "Cell") -> que.seu@meta.data
rownames(que.seu@meta.data) <- que.seu@meta.data$Cell
head(que.seu@meta.data)
Cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.2
AML556-D15_AAAAAGGTGCTN AML556-D15_AAAAAGGTGCTN AML556-D15 1251 609 0
AML556-D15_AAAAAGGTTTCT AML556-D15_AAAAAGGTTTCT AML556-D15 1253 529 0
AML556-D15_AAAACCTCGGTN AML556-D15_AAAACCTCGGTN AML556-D15 1610 671 0
AML556-D15_AAACACTAAGGN AML556-D15_AAACACTAAGGN AML556-D15 1229 597 0
AML556-D15_AAACAGCTACAA AML556-D15_AAACAGCTACAA AML556-D15 1062 547 0
AML556-D15_AAACCTCCGGGA AML556-D15_AAACCTCCGGGA AML556-D15 1325 708 0
RNA_snn_res.0.4 RNA_snn_res.0.6 RNA_snn_res.0.8 seurat_clusters
AML556-D15_AAAAAGGTGCTN 0 1 1 1
AML556-D15_AAAAAGGTTTCT 0 0 0 0
AML556-D15_AAAACCTCGGTN 0 0 0 0
AML556-D15_AAACACTAAGGN 0 0 0 0
AML556-D15_AAACAGCTACAA 0 1 1 1
AML556-D15_AAACCTCCGGGA 1 2 2 2
Cell type Score
AML556-D15_AAAAAGGTGCTN T cells2 (Placenta_VentoTormo) 0.2451411
AML556-D15_AAAAAGGTTTCT T cells2 (Placenta_VentoTormo) 0.2001931
AML556-D15_AAAACCTCGGTN T cells2 (Placenta_VentoTormo) 0.2413753
AML556-D15_AAACACTAAGGN T cells3 (Placenta_VentoTormo) 0.2812856
AML556-D15_AAACAGCTACAA T cells3 (Placenta_VentoTormo) 0.2290333
AML556-D15_AAACCTCCGGGA T cells2 (Placenta_VentoTormo) 0.2517737
差异分析:
DT::datatable(FindAllMarkers(que.seu))
Calculating cluster 0
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=23s
Calculating cluster 1
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=26s
Calculating cluster 2
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=19s
Calculating cluster 3
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=28s
Calculating cluster 4
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 07s
Calculating cluster 5
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=24s
这可以和HCL官网的数据做个比较:
做个我们的桑基图看一看我们的注释结果吧。
library(sankeywheel)
sankeywheel(
from = que.seu@meta.data$seurat_clusters, to = que.seu@meta.data$`Cell type`,
weight = que.seu@meta.data$Score,#type = "sankey",
title = "sk",
subtitle = "scHCL Project",
seriesName = "", width = "100%", height = "600px"
)
还是掰直了吧:
我们看到0,1,2到细胞类型那都比价均匀第混合在一起了,再看看他们最可能的注释结果:
que.seu@meta.data %>%
group_by(seurat_clusters,`Cell type`) %>%
count(`Cell type`) %>%
arrange(seurat_clusters,desc(n)) %>%
group_by(seurat_clusters)%>%
top_n(1)#summarise()
Selecting by n
# A tibble: 6 x 3
# Groups: seurat_clusters [6]
seurat_clusters `Cell type` n
<fct> <fct> <int>
1 0 T cells2 (Placenta_VentoTormo) 252
2 1 T cells2 (Placenta_VentoTormo) 239
3 2 T cells2 (Placenta_VentoTormo) 131
4 3 B cell (Plasmocyte) (Adult-Thyroid2) 38
5 4 Immature myeloid progenitor cell (Haematopoietic-Stem-Cell_Velten) 59
6 5 dNKp (Placenta_VentoTormo) 25
这是seurat_clusters = RNA_snn_res.0.8 时的分群以及注释结果,果然0,1,2都注释到了同一种细胞类型,这是真的吗?所以我们希望知道这三个群的关系是怎样的呢?
library(clustree)
que.seu@meta.data %>%
select(starts_with("RNA_snn_res")) %>%
mutate(RNA_snn_res.0.0=0)%>%
clustree( prefix = "RNA_snn_res.0")
他们真的就是来自 RNA_snn_res.0.2时的同一个群啊,这是不是启发我们:其实分为3-4个群也是可以呢?
当 RNA_snn_res.0.2时:
que.seu@meta.data %>%
group_by(RNA_snn_res.0.2,`Cell type`) %>%
count(`Cell type`) %>%
arrange(RNA_snn_res.0.2,desc(n)) %>%
group_by(RNA_snn_res.0.2)%>%
top_n(1)#summarise()
Selecting by n
# A tibble: 3 x 3
# Groups: RNA_snn_res.0.2 [3]
RNA_snn_res.0.2 `Cell type` n
<fct> <fct> <int>
1 0 T cells2 (Placenta_VentoTormo) 621
2 1 Immature myeloid progenitor cell (Haematopoietic-Stem-Cell_Velten) 61
3 2 B cell (Plasmocyte) (Adult-Thyroid2) 38
que.seu@meta.data %>% mutate(CellType = ifelse(RNA_snn_res.0.2 == 0,"T cells2 (Placenta_VentoTormo)",
ifelse(RNA_snn_res.0.2==1,"Immature myeloid progenitor cell (Haematopoietic-Stem-Cell_Velten)",
"B cell (Plasmocyte) (Adult-Thyroid2)")))->que.seu@meta.data
rownames(que.seu@meta.data ) <-que.seu@meta.data$Cell
head(que.seu@meta.data)
DimPlot(que.seu,group.by ="CellType")
这里是探索细胞类型,而不是百分百的鉴定它,但是这至少给了我们一个可能的选择。我相信细胞类型的注释会随着单细胞图谱的发表而得到改善,基于表达谱mapping的方法是个应用的方向。其实之前我们用seurst就讲过它可以把未知类型的细胞mapping到reference上,只是一直以来缺少可以mapping的reference。
ref.seu <-CreateSeuratObject(ref.expr)
ref.seu %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA() -> ref.seu
head(ref.seu@meta.data)
?FindTransferAnchors
anchors <- FindTransferAnchors(reference = ref.seu, query = que.seu,
dims = 1:30,k.anchor = 3 ,k.filter = 50,
features = intersect(VariableFeatures(que.seu),VariableFeatures(ref.seu))[1:500])# features = VariableFeatures(rque.seu)[1:500]
?TransferData
ref.seu %>% FindNeighbors() -> ref.seu
ref.seu <-FindClusters(
ref.seu,
resolution = c(0.2)
)
head(ref.seu@meta.data)
table(Idents(ref.seu))
predictions <- TransferData(anchorset = anchors, refdata = ref.seu@meta.data$seurat_clusters,
dims = 1:10,
k.weight = 5,)
head(predictions)
query <- AddMetaData(que.seu, metadata = predictions)
head(query@meta.data)
query$prediction.match <- query$predicted.id == query$celltype
query %>% RunUMAP(dims = 1:30)%>% RunTSNE(dims = 1:30) -> query
head(query@meta.data)
Cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.2
AML556-D15_AAAAAGGTGCTN AML556-D15_AAAAAGGTGCTN AML556-D15 1251 609 0
AML556-D15_AAAAAGGTTTCT AML556-D15_AAAAAGGTTTCT AML556-D15 1253 529 0
AML556-D15_AAAACCTCGGTN AML556-D15_AAAACCTCGGTN AML556-D15 1610 671 0
AML556-D15_AAACACTAAGGN AML556-D15_AAACACTAAGGN AML556-D15 1229 597 0
AML556-D15_AAACAGCTACAA AML556-D15_AAACAGCTACAA AML556-D15 1062 547 0
AML556-D15_AAACCTCCGGGA AML556-D15_AAACCTCCGGGA AML556-D15 1325 708 0
RNA_snn_res.0.4 RNA_snn_res.0.6 RNA_snn_res.0.8 seurat_clusters
AML556-D15_AAAAAGGTGCTN 0 1 1 1
AML556-D15_AAAAAGGTTTCT 0 0 0 0
AML556-D15_AAAACCTCGGTN 0 0 0 0
AML556-D15_AAACACTAAGGN 0 0 0 0
AML556-D15_AAACAGCTACAA 0 1 1 1
AML556-D15_AAACCTCCGGGA 1 2 2 2
Cell type Score CellTpye
AML556-D15_AAAAAGGTGCTN T cells2 (Placenta_VentoTormo) 0.2451411 T cells2 (Placenta_VentoTormo)
AML556-D15_AAAAAGGTTTCT T cells2 (Placenta_VentoTormo) 0.2001931 T cells2 (Placenta_VentoTormo)
AML556-D15_AAAACCTCGGTN T cells2 (Placenta_VentoTormo) 0.2413753 T cells2 (Placenta_VentoTormo)
AML556-D15_AAACACTAAGGN T cells3 (Placenta_VentoTormo) 0.2812856 T cells2 (Placenta_VentoTormo)
AML556-D15_AAACAGCTACAA T cells3 (Placenta_VentoTormo) 0.2290333 T cells2 (Placenta_VentoTormo)
AML556-D15_AAACCTCCGGGA T cells2 (Placenta_VentoTormo) 0.2517737 T cells2 (Placenta_VentoTormo)
CellType predicted.id prediction.score.1
AML556-D15_AAAAAGGTGCTN T cells2 (Placenta_VentoTormo) 2 0.0000000
AML556-D15_AAAAAGGTTTCT T cells2 (Placenta_VentoTormo) 2 0.1732816
AML556-D15_AAAACCTCGGTN T cells2 (Placenta_VentoTormo) 2 0.0000000
AML556-D15_AAACACTAAGGN T cells2 (Placenta_VentoTormo) 2 0.1732816
AML556-D15_AAACAGCTACAA T cells2 (Placenta_VentoTormo) 2 0.0000000
AML556-D15_AAACCTCCGGGA T cells2 (Placenta_VentoTormo) 2 0.0000000
prediction.score.0 prediction.score.3 prediction.score.5 prediction.score.2
AML556-D15_AAAAAGGTGCTN 0 0 0 1.0000000
AML556-D15_AAAAAGGTTTCT 0 0 0 0.8267184
AML556-D15_AAAACCTCGGTN 0 0 0 1.0000000
AML556-D15_AAACACTAAGGN 0 0 0 0.8267184
AML556-D15_AAACAGCTACAA 0 0 0 1.0000000
AML556-D15_AAACCTCCGGGA 0 0 0 1.0000000
prediction.score.6 prediction.score.4 prediction.score.max
AML556-D15_AAAAAGGTGCTN 0 0 1.0000000
AML556-D15_AAAAAGGTTTCT 0 0 0.8267184
AML556-D15_AAAACCTCGGTN 0 0 1.0000000
AML556-D15_AAACACTAAGGN 0 0 0.8267184
AML556-D15_AAACAGCTACAA 0 0 1.0000000
AML556-D15_AAACCTCCGGGA 0 0 1.0000000
#
#DimPlot(query,group.by ="predicted.id")