第三节:多样本数据整合
在本节教程中,我们将探讨多个样本scRNA-seq数据集整合的不同方法。我们使用两种不同的方法来校正跨数据集的批处理效应。同时,我们还给出一种量化措施,以评估不同数据集整合的效果。Seurat使用单细胞数据综合集成中介绍的数据整合方法,而Scran和Scanpy使用相互最近邻方法(MNN)。以下是用于多样本数据集整合的常用方法:
Markdown | Language | Library | Ref |
---|---|---|---|
CCA | R | Seurat | Cell |
MNN | R/Python | Scater/Scanpy | Nat. Biotech. |
Conos | R | conos | Nat. Methods |
Scanorama | Python | scanorama | Nat. Biotech. |
加载所需的R包和质控降维后的数据
# 加载所需的R包
suppressPackageStartupMessages({
library(Seurat)
library(cowplot)
library(ggplot2)
})
# 加载质控降维后的数据集
alldata <- readRDS("data/results/covid_qc_dr.rds")
alldata
#An object of class Seurat
#18121 features across 5532 samples within 1 assay
#Active assay: RNA (18121 features, 2000 variable features)
# 6 dimensional reductions calculated: pca, umap, tsne, UMAP10_on_PCA, UMAP_on_ScaleData, #UMAP_on_Graph
print(names(alldata@reductions))
## [1] "pca" "umap" "tsne"
## [4] "UMAP10_on_PCA" "UMAP_on_ScaleData" "UMAP_on_Graph"
分割样本数据构建多个数据集
接下来,我们将合并后的数据按样本分割成多个不同的数据集,并分别对每个样本的数据进行数据标准化 (log-normalization),筛选高变基因。
# 使用SplitObject函数分割seurat对象
alldata.list <- SplitObject(alldata, split.by = "orig.ident")
# 分别对每个样本数据进行标准化并筛选高变基因
for (i in 1:length(alldata.list)) {
alldata.list[[i]] <- NormalizeData(alldata.list[[i]], verbose = FALSE)
alldata.list[[i]] <- FindVariableFeatures(alldata.list[[i]], selection.method = "vst",
nfeatures = 2000, verbose = FALSE)
}
# 提取每个数据集中筛选到的高变基因
hvgs_per_dataset <- lapply(alldata.list, function(x) {
x@assays$RNA@var.features
})
head(hvgs_per_dataset)
# 绘制venn图查看不同样本中高变基因的重叠情况
venn::venn(hvgs_per_dataset, opacity = 0.4, zcolor = (scales::hue_pal())(3), cexsn = 1,
cexil = 1, lwd = 1, col = "white", frame = F, borders = NA)
鉴定数据整合的anchors
我们使用FindIntegrationAnchors
函数鉴定用于多样本数据整合的anchors。
alldata.anchors <- FindIntegrationAnchors(object.list = alldata.list, dims = 1:30,
reduction = "cca")
# Computing 2000 integration features
# Scaling features for provided objects
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
# Finding all pairwise anchors
# | | 0 % ~calculating Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2010 anchors
# Filtering anchors
# Retained 1720 anchors
# Extracting within-dataset neighbors
# |++++ | 7 % ~25s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2156 anchors
# Filtering anchors
# Retained 1728 anchors
# Extracting within-dataset neighbors
# |+++++++ | 13% ~24s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2700 anchors
# Filtering anchors
# Retained 2255 anchors
# Extracting within-dataset neighbors
# |++++++++++ | 20% ~23s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 1823 anchors
# Filtering anchors
# Retained 1463 anchors
# Extracting within-dataset neighbors
# |++++++++++++++ | 27% ~24s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2284 anchors
# Filtering anchors
# Retained 1890 anchors
# Extracting within-dataset neighbors
# |+++++++++++++++++ | 33% ~22s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2519 anchors
# Filtering anchors
# Retained 1980 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++ | 40% ~20s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2097 anchors
# Filtering anchors
# Retained 1619 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++ | 47% ~17s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2624 anchors
# Filtering anchors
# Retained 2226 anchors
# Extracting within-dataset neighbors
# |+++++++++++++++++++++++++++ | 53% ~15s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2909 anchors
# Filtering anchors
# Retained 2177 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++ | 60% ~13s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2914 anchors
# Filtering anchors
# Retained 2502 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++++++ | 67% ~12s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2081 anchors
# Filtering anchors
# Retained 1647 anchors
# Extracting within-dataset neighbors
# |+++++++++++++++++++++++++++++++++++++ | 73% ~09s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2512 anchors
# Filtering anchors
# Retained 2161 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++++++++++++ | 80% ~07s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2831 anchors
# Filtering anchors
# Retained 2178 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++++++++++++++++ | 87% ~05s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2737 anchors
# Filtering anchors
# Retained 2274 anchors
# Extracting within-dataset neighbors
# |+++++++++++++++++++++++++++++++++++++++++++++++ | 93% ~02s Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 3067 anchors
# Filtering anchors
# Retained 2774 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=36s
整合多样本数据集
接下来,我们使用IntegrateData
函数利用鉴定的anchors进行多样本数据集的整合。
alldata.int <- IntegrateData(anchorset = alldata.anchors, dims = 1:30, new.assay.name = "CCA")
Merging dataset 1 into 3
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 6 into 5
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 2 into 3 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 5 6 into 3 1 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 4 into 3 1 2 5 6
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Warning: Adding a command log without an assay associated with it
We can observe that a new assay slot is now created under the name CCA
.
names(alldata.int@assays)
## [1] "RNA" "CCA"
# by default, Seurat now sets the integrated assay as the default assay, so any
# operation you now perform will be on the ingegrated data.
alldata.int@active.assay
## [1] "CCA"
运行完IntegrateData
函数后,在Seurat对象中会生成一个新的integrated (or ‘batch-corrected’)的Assay。原始的未进行批次矫正的数据仍然存储在"RNA"的Assay中,我们可以使用整合后的“integrated”的Assay去进行后续的降维可视化分析。
数据降维可视化
接下来,我们对整合后的数据进行scaling归一化,PCA和UMAP/t-SNE降维可视化。
# Run Dimensionality reduction on integrated space
alldata.int <- ScaleData(alldata.int, verbose = FALSE)
alldata.int <- RunPCA(alldata.int, npcs = 30, verbose = FALSE)
alldata.int <- RunUMAP(alldata.int, dims = 1:30)
alldata.int <- RunTSNE(alldata.int, dims = 1:30)
# 可视化批次矫正后数据整合后的结果
plot_grid(ncol = 3,
DimPlot(alldata, reduction = "pca", group.by = "orig.ident")+NoAxes()+ggtitle("PCA raw_data"),
DimPlot(alldata, reduction = "tsne", group.by = "orig.ident")+NoAxes()+ggtitle("tSNE raw_data"),
DimPlot(alldata, reduction = "umap", group.by = "orig.ident")+NoAxes()+ggtitle("UMAP raw_data"),
DimPlot(alldata.int, reduction = "pca", group.by = "orig.ident")+NoAxes()+ggtitle("PCA integrated"),
DimPlot(alldata.int, reduction = "tsne", group.by = "orig.ident")+NoAxes()+ggtitle("tSNE integrated"),
DimPlot(alldata.int, reduction = "umap", group.by = "orig.ident")+NoAxes()+ggtitle("UMAP integrated")
)
如图所示,我们可以看到批次校正后整合的数据可以很好按细胞类型的混合到一起,而不存在明显的样本间的批次。
可视化细胞类型特异的marker基因
Markers | Cell Type |
---|---|
CD3E | T cells |
CD3E CD4 | CD4+ T cells |
CD3E CD8A | CD8+ T cells |
GNLY, NKG7 | NK cells |
MS4A1 | B cells |
CD14, LYZ, CST3, MS4A7 | CD14+ Monocytes |
FCGR3A, LYZ, CST3, MS4A7 | FCGR3A+ Monocytes |
FCER1A, CST3 | DCs |
FeaturePlot(alldata.int, reduction = "umap", features = c("CD3E", "CD4", "CD8A",
"NKG7", "GNLY", "MS4A1", "CD14", "LYZ", "MS4A7", "FCGR3A", "CST3", "FCER1A"),
order = T, slot = "data", combine = T)
使用harmony方法进行批次矫正多样本数据整合
# 加载harmony包
library(harmony)
## Loading required package: Rcpp
# 使用RunHarmony函数进行批次矫正
alldata.harmony <- RunHarmony(alldata, group.by.vars = "orig.ident", reduction = "pca",
dims.use = 1:50, assay.use = "RNA")
Harmony 1/10
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 2/10
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 3/10
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 4/10
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 5/10
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 6/10
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony converged after 6 iterations
Warning: Invalid name supplied, making object name syntactically valid. New object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details on syntax validity
# Here we use all PCs computed from Harmony for UMAP calculation
alldata.int[["harmony"]] <- alldata.harmony[["harmony"]]
# UMAP降维可视化
alldata.int <- RunUMAP(alldata.int, dims = 1:50, reduction = "harmony", reduction.name = "umap_harmony")
使用scanorama方法进行批次矫正多样本数据整合
# 加载所需的R包
library(reticulate)
scanorama <- import("scanorama")
hvgs <- unique(unlist(hvgs_per_dataset))
head(hvgs)
# [1] "PTGDS" "IGLC3" "IGLC2" "HLA-DQA1" "IGHM" "CD79A"
assaylist <- list()
genelist <- list()
for (i in 1:length(alldata.list)) {
assaylist[[i]] <- t(as.matrix(GetAssayData(alldata.list[[i]], "data")[hvgs, ]))
genelist[[i]] <- hvgs
}
lapply(assaylist, dim)
## [[1]]
## [1] 540 5203
##
## [[2]]
## [1] 837 5203
##
## [[3]]
## [1] 976 5203
##
## [[4]]
## [1] 1022 5203
##
## [[5]]
## [1] 1129 5203
##
## [[6]]
## [1] 1028 5203
integrated.data <- scanorama$integrate(datasets_full = assaylist, genes_list = genelist)
intdimred <- do.call(rbind, integrated.data[[1]])
colnames(intdimred) <- paste0("PC_", 1:100)
rownames(intdimred) <- colnames(alldata.int)
# Add standard deviations in order to draw Elbow Plots in Seurat
stdevs <- apply(intdimred, MARGIN = 2, FUN = sd)
alldata.int[["scanorama"]] <- CreateDimReducObject(embeddings = intdimred, stdev = stdevs,
key = "PC_", assay = "RNA")
## Warning: Cannot add objects with duplicate keys (offending key: PC_), setting
## key to 'scanorama_'
# Here we use all PCs computed from Scanorama for UMAP calculation
alldata.int <- RunUMAP(alldata.int, dims = 1:100, reduction = "scanorama", reduction.name = "umap_scanorama")
不同整合方法结果的可视化
p1 <- DimPlot(alldata, reduction = "umap", group.by = "orig.ident") + ggtitle("UMAP raw_data")
p2 <- DimPlot(alldata.int, reduction = "umap", group.by = "orig.ident") + ggtitle("UMAP CCA")
p3 <- DimPlot(alldata.int, reduction = "umap_harmony", group.by = "orig.ident") +
ggtitle("UMAP Harmony")
p4 <- DimPlot(alldata.int, reduction = "umap_scanorama", group.by = "orig.ident") +
ggtitle("UMAP Scanorama")
leg <- get_legend(p1)
gridExtra::grid.arrange(gridExtra::arrangeGrob(p1 + NoLegend() + NoAxes(), p2 + NoLegend() +
NoAxes(), p3 + NoLegend() + NoAxes(), p4 + NoLegend() + NoAxes(), nrow = 2),
leg, ncol = 2, widths = c(8, 2))
保存数据整合后的结果
saveRDS(alldata.int, "data/results/covid_qc_dr_int.rds")
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Catalina 10.15.5
##
## Matrix products: default
## BLAS/LAPACK: /Users/paulo.czarnewski/.conda/envs/scRNAseq2021/lib/libopenblasp-r0.3.12.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 grid stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] reticulate_1.18 harmony_1.0
## [3] Rcpp_1.0.6 scran_1.18.0
## [5] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
## [7] Biobase_2.50.0 GenomicRanges_1.42.0
## [9] GenomeInfoDb_1.26.0 IRanges_2.24.0
## [11] S4Vectors_0.28.0 BiocGenerics_0.36.0
## [13] MatrixGenerics_1.2.0 matrixStats_0.57.0
## [15] ggplot2_3.3.3 cowplot_1.1.1
## [17] KernSmooth_2.23-18 fields_11.6
## [19] spam_2.6-0 dotCall64_1.0-0
## [21] DoubletFinder_2.0.3 Matrix_1.3-2
## [23] Seurat_3.2.3 RJSONIO_1.3-1.4
## [25] optparse_1.6.6
##
## loaded via a namespace (and not attached):
## [1] plyr_1.8.6 igraph_1.2.6
## [3] lazyeval_0.2.2 splines_4.0.3
## [5] BiocParallel_1.24.0 listenv_0.8.0
## [7] scattermore_0.7 digest_0.6.27
## [9] htmltools_0.5.1 magrittr_2.0.1
## [11] tensor_1.5 cluster_2.1.0
## [13] ROCR_1.0-11 limma_3.46.0
## [15] remotes_2.2.0 globals_0.14.0
## [17] colorspace_2.0-0 ggrepel_0.9.1
## [19] xfun_0.20 dplyr_1.0.3
## [21] crayon_1.3.4 RCurl_1.98-1.2
## [23] jsonlite_1.7.2 spatstat_1.64-1
## [25] spatstat.data_1.7-0 survival_3.2-7
## [27] zoo_1.8-8 glue_1.4.2
## [29] polyclip_1.10-0 gtable_0.3.0
## [31] zlibbioc_1.36.0 XVector_0.30.0
## [33] leiden_0.3.6 DelayedArray_0.16.0
## [35] BiocSingular_1.6.0 future.apply_1.7.0
## [37] maps_3.3.0 abind_1.4-5
## [39] scales_1.1.1 edgeR_3.32.0
## [41] DBI_1.1.1 miniUI_0.1.1.1
## [43] viridisLite_0.3.0 xtable_1.8-4
## [45] dqrng_0.2.1 bit_4.0.4
## [47] rsvd_1.0.3 htmlwidgets_1.5.3
## [49] httr_1.4.2 getopt_1.20.3
## [51] RColorBrewer_1.1-2 ellipsis_0.3.1
## [53] ica_1.0-2 scuttle_1.0.0
## [55] pkgconfig_2.0.3 farver_2.0.3
## [57] uwot_0.1.10 deldir_0.2-9
## [59] locfit_1.5-9.4 tidyselect_1.1.0
## [61] labeling_0.4.2 rlang_0.4.10
## [63] reshape2_1.4.4 later_1.1.0.1
## [65] munsell_0.5.0 tools_4.0.3
## [67] generics_0.1.0 ggridges_0.5.3
## [69] evaluate_0.14 stringr_1.4.0
## [71] fastmap_1.0.1 yaml_2.2.1
## [73] goftest_1.2-2 knitr_1.30
## [75] bit64_4.0.5 fitdistrplus_1.1-3
## [77] admisc_0.11 purrr_0.3.4
## [79] RANN_2.6.1 sparseMatrixStats_1.2.0
## [81] pbapply_1.4-3 future_1.21.0
## [83] nlme_3.1-151 mime_0.9
## [85] formatR_1.7 venn_1.9
## [87] hdf5r_1.3.3 compiler_4.0.3
## [89] plotly_4.9.3 curl_4.3
## [91] png_0.1-7 spatstat.utils_1.20-2
## [93] statmod_1.4.35 tibble_3.0.5
## [95] stringi_1.5.3 RSpectra_0.16-0
## [97] bluster_1.0.0 lattice_0.20-41
## [99] vctrs_0.3.6 pillar_1.4.7
## [101] lifecycle_0.2.0 lmtest_0.9-38
## [103] BiocNeighbors_1.8.0 RcppAnnoy_0.0.18
## [105] data.table_1.13.6 bitops_1.0-6
## [107] irlba_2.3.3 httpuv_1.5.5
## [109] patchwork_1.1.1 R6_2.5.0
## [111] promises_1.1.1 gridExtra_2.3
## [113] parallelly_1.23.0 codetools_0.2-18
## [115] MASS_7.3-53 assertthat_0.2.1
## [117] withr_2.4.0 sctransform_0.3.2
## [119] GenomeInfoDbData_1.2.4 mgcv_1.8-33
## [121] beachmat_2.6.0 rpart_4.1-15
## [123] tidyr_1.1.2 DelayedMatrixStats_1.12.0
## [125] rmarkdown_2.6 Rtsne_0.15
## [127] shiny_1.5.0
参考来源:https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/seurat/seurat_03_integration.html