导言
我们之前介绍了CITE-seq的测序原理和使用CITE-seq-Count工具处理reads数据得到蛋白表达矩阵。
CITE-seq-Count输出文件结构如下:
OUTFOLDER/
-- umi_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- read_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- unmapped.csv
-- run_report.yaml
跟处理scRNA-seq数据的 cell ranger工具的输出结果有些类似。最后需要用到的是其中umi_count文件夹下的matrix.mtx.gz,features.tsv.gz,barcodes.tsv.gz这三个文件。
因为mRNA 表达和细胞表面蛋白表达是配套的数据,也就是每个细胞都得到了其mRNA全转录组和部分细胞表面蛋白的表达水平,那么分析这两个数据就有两种思路:
- 仅基于scRNA-seq 数据对细胞进行聚类、分群、降维等标准流程,然后将细胞表面蛋白表达水平(ADT值)作为附加信息进行可视化等分析[1]。这种分析方法跟常规的单细胞转录组数据分析流程没有太大差别。
- 同时基于scRNA-seq数据和细胞表面蛋白表达数据(ADT值)对细胞进行聚类、分群、降维等分析[2]。这种方法需要使用到weighted-nearest neighbor’ (WNN)等多模型数据整合算法来对数据进行整合。
WNN算法是一个无监督的框架,用于了解每个细胞中每种数据类型的相对效用,从而能够对多种数据类型进行综合分析。
Seurat官方示例教程中有提到,对30672个同时测了mRNA和25个细胞表面蛋白的来自骨髓的单细胞数据分析结果显示,细胞表面蛋白可以帮助一些免疫细胞状态的细分:
We find that the RNA analysis is more informative than the ADT analysis in identifying progenitor states (the ADT panel contains markers for differentiated cells), while the converse is true of T cell states (where the ADT analysis outperforms RNA).
对该骨髓来源来的免疫细胞数据,mRNA表达数据能更好的区分祖细胞的状态,细胞表面蛋白表达数据则能更好的区分T细胞的状态。当然这有可能是因为该数据检测的细胞表面蛋白抗体panel里,包括了针对分化细胞的marker而没有包括针对祖细胞状态的marker,毕竟它只包含了25个蛋白。
鉴于Seurat的示例数据里细胞表面蛋白的检测panel都比较小,为了验证细胞表面蛋白表达数据在帮助各种细胞类型细分时的效果,我们用之前提到的那篇文献中的BRCA数据进行分析,该研究测了118个细胞表面蛋白panel,有来自4个样本共7363个单细胞的scCITE-seq数据。
Seurat包分析CITE-seq数据
library(Seurat)
library(ggplot2)
library(patchwork)
###
adt=Read10X('./umi_count/',gene.column = 1)
med<-read.csv('./metadata.csv',row.names = 1)
counts <- Read10X(data.dir = path,gene.column=1)
brca.cite.med=med[med$cell_id%in%colnames(adt),]
table(brca.cite.med$celltype_major)
# B-cells CAFs Cancer Epithelial Endothelial Myeloid
# 631 686 7 549 1000
# Plasmablasts PVL T-cells
# 151 972 3374
#可以看到肿瘤上皮细胞非常少。为了方面后续分析,这里去掉这些肿瘤上皮细胞
brca.cite.med=brca.cite.med[brca.cite.med$celltype_major!='Cancer Epithelial',]
brca.cite.rna=counts[,rownames(brca.cite.med)]
brca.cite.adt=adt[,rownames(brca.cite.med)]
##
cols_major= RColorBrewer::brewer.pal(n=7,name='Paired')
names(cols_major)=unique(brca.cite.med$celltype_major)
###########################################################
### Define cellular states based on only mRNA expression ##
############################################################
###Setup a Seurat object, add the RNA and protein data
brca.cite <- CreateSeuratObject(counts = brca.cite.rna,
meta.data =brca.cite.med )
adt_assay <- CreateAssayObject(counts = brca.cite.adt)
# add this assay to the previously created Seurat object
brca.cite[["ADT"]] <- adt_assay
###Cluster cells on the basis of their scRNA-seq profiles
DefaultAssay(brca.cite) <- "RNA"
# perform visualization and clustering steps
brca.cite <- NormalizeData(brca.cite)
brca.cite <- FindVariableFeatures(brca.cite)
brca.cite <- ScaleData(brca.cite)
brca.cite <- RunPCA(brca.cite, verbose = FALSE)
# 因为我们已经有细胞的分群信息,所以不再进行聚类和分群
# brca.cite <- FindNeighbors(brca.cite, dims = 1:30)
# brca.cite <- FindClusters(brca.cite, resolution = 0.8, verbose = FALSE)
brca.cite <- RunUMAP(brca.cite,reduction = "pca", dims = 1:30)
p1=DimPlot(brca.cite,group.by = 'orig.ident')+NoLegend()
p2=DimPlot(brca.cite,group.by = 'celltype_major', cols = cols_major,label = T,label.size = 5,repel = T)
p1|p2
###Normalize ADT data,
DefaultAssay(brca.cite) <- "ADT"
brca.cite <- NormalizeData(brca.cite, normalization.method = "CLR", margin = 2)
DefaultAssay(brca.cite) <- "RNA"
###Identify cell surface markers for scRNA-seq clusters
Idents(brca.cite)=brca.cite$celltype_major
adt_markers <- FindAllMarkers(brca.cite, assay = "ADT")
rna_markers <- FindAllMarkers(brca.cite, assay = "RNA")
###查看一些基质细胞和免疫细胞标志基因的蛋白和mRNA的表达:
p1 <- FeaturePlot(brca.cite, c("adt_CD45",'adt_CD31','adt_CD19',"adt_MCAM"),slot='data', cols = viridis(15),ncol = 4)
p2 <- FeaturePlot(brca.cite, c("rna_PTPRC",'rna_PECAM1','rna_CD19',"rna_MCAM"),slot='data', cols =inferno(15),ncol = 4)
p1 /p2
################################################################
##### WNN: Define cellular states based on both data types #####
################################################################
brca.cite_wnn <- CreateSeuratObject(counts = brca.cite.rna,meta.data =brca.cite.med )
adt_assay <- CreateAssayObject(counts = brca.cite.adt)
# add this assay to the previously created Seurat object
brca.cite_wnn[["ADT"]] <- adt_assay
#### Pre-processing and dimensional reduction on both assays independently.
DefaultAssay(brca.cite_wnn) <- 'RNA'
brca.cite_wnn <- NormalizeData(brca.cite_wnn) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()
DefaultAssay(brca.cite_wnn) <- 'ADT'
VariableFeatures(brca.cite_wnn) <- rownames(brca.cite_wnn[["ADT"]])
brca.cite_wnn <- NormalizeData(brca.cite_wnn, normalization.method = 'CLR', margin = 2) %>%
ScaleData() %>% RunPCA(reduction.name = 'apca')
### Identify multimodal neighbors
brca.cite_wnn <- FindMultiModalNeighbors(
brca.cite_wnn, reduction.list = list("pca", "apca"),
dims.list = list(1:30, 1:20), modality.weight.name = "RNA.weight"
)
### Clustering cells
brca.cite_wnn <- RunUMAP(brca.cite_wnn, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
#brca.cite_wnn <- FindClusters(brca.cite_wnn, graph.name = "wsnn", algorithm = 3, resolution = 2, verbose = FALSE)
p1 <- DimPlot(brca.cite_wnn, reduction = 'wnn.umap', group.by = 'orig.ident') + NoLegend()
p2 <- DimPlot(brca.cite_wnn, reduction = 'wnn.umap', group.by = 'celltype_major', cols =cols_major,label = T,label.size = 5,repel = T) #+ NoLegend()
p1 + p2
p5 <- FeaturePlot(brca.cite_wnn, c("adt_CD45",'adt_CD31','adt_CD19',"adt_MCAM"),slot='data', cols = viridis(15),ncol = 4)
p6 <- FeaturePlot(brca.cite_wnn, c("rna_PTPRC",'rna_PECAM1','rna_CD19',"rna_MCAM"),slot='data', cols =inferno(15),ncol = 4)
p5 /p6
###Compareing the UMAP visualization based on only the RNA, protein data and both of them
brca.cite_wnn <- RunUMAP(brca.cite_wnn, reduction = 'pca', dims = 1:30, assay = 'RNA',
reduction.name = 'rna.umap', reduction.key = 'rnaUMAP_')
brca.cite_wnn <- RunUMAP(brca.cite_wnn, reduction = 'apca', dims = 1:20, assay = 'ADT',
reduction.name = 'adt.umap', reduction.key = 'adtUMAP_')
p3 <- DimPlot(brca.cite_wnn, reduction = 'rna.umap', group.by = 'celltype_major', label = F, cols =cols_major)+ ggtitle("RNA") +theme(plot.title = element_text(hjust = 0.5)) + NoLegend()
p4 <- DimPlot(brca.cite_wnn, reduction = 'adt.umap', group.by = 'celltype_major', label = F, cols =cols_major)+ ggtitle("ADT") +theme(plot.title = element_text(hjust = 0.5)) + NoLegend()
p5<- DimPlot(brca.cite_wnn, reduction = 'wnn.umap', group.by = 'celltype_major', label = F, cols =cols_major)+ ggtitle("WNN") +theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom") #+ NoLegend()
p5+p3+p4
再来比较一下在细分时,两种方法的效果:
Reference:
[1].https://satijalab.org/seurat/articles/multimodal_vignette.html
[2].https://satijalab.org/seurat/articles/weighted_nearest_neighbor_analysis.html
[3].https://www.jianshu.com/p/ceee6aa17335
欢迎关注同名公众号SciNote