简介
目前单细胞数据分析有不少的挑战,比如稀疏矩阵,超高维度数据降维,批次效应校正,聚类算法的选择,多组学数据整合等挑战。有需求就需要造轮子,当前也有了部分工具可以处理部分挑战,但准确性还需要具体评估,本篇文章介绍实操整合多个批次数据工具SeuratV3中嵌合的整合算法处理官网实示例PBMC刺激及对照实验组数据,SeuratV3简介及实操一文中有对该工具的具体介绍及常规实操。
整合目标
概述使用Seurat集成过程可能实现的复杂细胞类型的比较分析。
- 识别两个数据集中存在的单元格类型
- 获得在对照和受刺激细胞中都保守的细胞类型标记
- 比较数据集以找到刺激的细胞类型特异性反应
生成Seurat对象
基因表达矩阵可以在官网提供路径找到。首先读入两个计数矩阵并设置Seurat对象。
library(Seurat)
library(cowplot)
ctrl.data <- read.table(file = "../data/immune_control_expression_matrix.txt.gz", sep = "\t")
stim.data <- read.table(file = "../data/immune_stimulated_expression_matrix.txt.gz", sep = "\t")
# Set up control object
ctrl <- CreateSeuratObject(counts = ctrl.data, project = "IMMUNE_CTRL", min.cells = 5)
ctrl$stim <- "CTRL"
ctrl <- subset(ctrl, subset = nFeature_RNA > 500)
ctrl <- NormalizeData(ctrl, verbose = FALSE)
ctrl <- FindVariableFeatures(ctrl, selection.method = "vst", nfeatures = 2000)
# Set up stimulated object
stim <- CreateSeuratObject(counts = stim.data, project = "IMMUNE_STIM", min.cells = 5)
stim$stim <- "STIM"
stim <- subset(stim, subset = nFeature_RNA > 500)
stim <- NormalizeData(stim, verbose = FALSE)
stim <- FindVariableFeatures(stim, selection.method = "vst", nfeatures = 2000)
运行集成工具
然后使用FindIntegrationAnchors函数识别锚点,该函数将Seurat对象列表作为输入,并使用这些锚点将两个数据集集成在一起IntegrateData。
immune.anchors <- FindIntegrationAnchors(object.list = list(ctrl, stim), dims = 1:20)
immune.combined <- IntegrateData(anchorset = immune.anchors, dims = 1:20)
综合分析
现在可以对所有细胞进行单一的综合分析!
DefaultAssay(immune.combined) <- "integrated"
# Run the standard workflow for visualization and clustering
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
# t-SNE and Clustering
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:20)
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:20)
immune.combined <- FindClusters(immune.combined, resolution = 0.5)
# Visualization
p1 <- DimPlot(immune.combined, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined, reduction = "umap", label = TRUE)
plot_grid(p1, p2)
为了并排显示这两个条件,可以使用split.by参数来显示由簇着色的每个条件。
DimPlot(immune.combined, reduction = "umap", split.by = "stim")
识别保守的细胞类型标记
为了鉴定在条件下保守的经典细胞类型标记基因,提供了该FindConservedMarkers功能。该功能对每个数据集/组执行差异基因表达测试,并使用MetaDE R包中的元分析方法组合p值。例如,可以计算保守标记的基因,而不考虑簇6(NK细胞)中的刺激条件。
DefaultAssay(immune.combined) <- "RNA"
nk.markers <- FindConservedMarkers(immune.combined, ident.1 = 7, grouping.var = "stim", verbose = FALSE)
head(nk.markers)
可以探索每个簇的这些标记基因,并使用它们来注释我们的簇作为特定的细胞类型。
FeaturePlot(immune.combined, features = c("CD3D", "SELL", "CREM", "CD8A", "GNLY", "CD79A", "FCGR3A",
"CCL2", "PPBP"), min.cutoff = "q9")
immune.combined <- RenameIdents(immune.combined, `0` = "CD14 Mono", `1` = "CD4 Naive T", `2` = "CD4 Memory T",
`3` = "CD16 Mono", `4` = "B", `5` = "CD8 T", `6` = "T activated", `7` = "NK", `8` = "DC", `9` = "B Activated",
`10` = "Mk", `11` = "pDC", `12` = "Eryth", `13` = "Mono/Mk Doublets")
DimPlot(immune.combined, label = TRUE)
DotPlot具有该split.by参数的功能可用于跨条件查看保守细胞类型标记,显示表达水平和表达任何给定基因的簇中细胞的百分比。在这里为13个簇中的每一个绘制2-3个强标记基因。
Idents(immune.combined) <- factor(Idents(immune.combined), levels = c("Mono/Mk Doublets", "pDC",
"Eryth", "Mk", "DC", "CD14 Mono", "CD16 Mono", "B Activated", "B", "CD8 T", "NK", "T activated",
"CD4 Naive T", "CD4 Memory T"))
markers.to.plot <- c("CD3D", "CREM", "HSPH1", "SELL", "GIMAP5", "CACYBP", "GNLY", "NKG7", "CCL5",
"CD8A", "MS4A1", "CD79A", "MIR155HG", "NME1", "FCGR3A", "VMO1", "CCL2", "S100A9", "HLA-DQA1",
"GPR183", "PPBP", "GNG11", "HBA2", "HBB", "TSPAN13", "IL3RA", "IGJ")
DotPlot(immune.combined, features = rev(markers.to.plot), cols = c("blue", "red"), dot.scale = 8,
split.by = "stim") + RotatedAxis()
确定跨条件的差异表达基因
现在已经对受刺激细胞和对照细胞进行了比对,可以开始进行比较分析并观察刺激引起的差异。广泛观察这些变化的一种方法是绘制受刺激细胞和对照细胞的平均表达,并在散点图上寻找视觉异常值的基因。在这里,采用受刺激和对照幼稚T细胞和CD14单核细胞群的平均表达并产生散点图,突出表现出对干扰素刺激的显着反应的基因。
t.cells <- subset(immune.combined, idents = "CD4 Naive T")
Idents(t.cells) <- "stim"
avg.t.cells <- log1p(AverageExpression(t.cells, verbose = FALSE)$RNA)
avg.t.cells$gene <- rownames(avg.t.cells)
cd14.mono <- subset(immune.combined, idents = "CD14 Mono")
Idents(cd14.mono) <- "stim"
avg.cd14.mono <- log1p(AverageExpression(cd14.mono, verbose = FALSE)$RNA)
avg.cd14.mono$gene <- rownames(avg.cd14.mono)
genes.to.label = c("ISG15", "LY6E", "IFI6", "ISG20", "MX1", "IFIT2", "IFIT1", "CXCL10", "CCL8")
p1 <- ggplot(avg.t.cells, aes(CTRL, STIM)) + geom_point() + ggtitle("CD4 Naive T Cells")
p1 <- LabelPoints(plot = p1, points = genes.to.label, repel = TRUE)
p2 <- ggplot(avg.cd14.mono, aes(CTRL, STIM)) + geom_point() + ggtitle("CD14 Monocytes")
p2 <- LabelPoints(plot = p2, points = genes.to.label, repel = TRUE)
plot_grid(p1, p2)
许多相同的基因在这两种细胞类型中都被上调,并且可能代表一种保守的干扰素应答途径。
因为我们有信心在整个条件下鉴定出共同的细胞类型,可以询问在相同类型的细胞的不同条件下哪些基因会发生变化。首先在meta.data插槽中创建一个列,以保存单元格类型和刺激信息,并将当前标识切换到该列。然后FindMarkers用来找到刺激和对照B细胞之间不同的基因。请注意,此处显示的许多顶级基因与我们之前作为核心干扰素应答基因绘制的基因相同。此外,看到的像CXCL10这样的基因对单核细胞和B细胞干扰素反应具有特异性,在该列表中也显示出非常重要的基因。
immune.combined$celltype.stim <- paste(Idents(immune.combined), immune.combined$stim, sep = "_")
immune.combined$celltype <- Idents(immune.combined)
Idents(immune.combined) <- "celltype.stim"
b.interferon.response <- FindMarkers(immune.combined, ident.1 = "B_STIM", ident.2 = "B_CTRL", verbose = FALSE)
head(b.interferon.response, n = 15)
另一种可视化基因表达变化的有用方法是split.by选择FeaturePlot或VlnPlot功能。这将显示给定基因列表的FeaturePlots,由分组变量(这里的刺激条件)分开。诸如CD3D和GNLY的基因是规范细胞类型标记(对于T细胞和NK / CD8T细胞),其实际上不受干扰素刺激的影响并且在对照和刺激组中显示相似的基因表达模式。另一方面,IFI6和ISG15是核心干扰素应答基因,并且在所有细胞类型中相应地上调。最后,CD14和CXCL10是显示细胞类型特异性干扰素应答的基因。CD14单核细胞刺激后CD14表达降低,这可能导致监督分析框架中的错误分类,强调综合分析的价值。
FeaturePlot(immune.combined, features = c("CD3D", "GNLY", "IFI6"), split.by = "stim", max.cutoff = 3,
cols = c("grey", "red"))
plots <- VlnPlot(immune.combined, features = c("LYZ", "ISG15", "CXCL10"), split.by = "stim", group.by = "celltype",
pt.size = 0, combine = FALSE)
CombinePlots(plots = plots, ncol = 1)
参考材料:
https://satijalab.org/seurat/v3.0/immune_alignment.html