Seurat standard pipeline
记录一下Seurat标准的单细胞分析流程,这里使用官方提供的pbmc3k作为示例
pbmc3k: https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
Seurat单细胞分析流程主要就是以下十句代码
pbmc.counts <- Read10X(data.dir = "data/filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.counts)
pbmc <- NormalizeData(object = pbmc)
pbmc <- FindVariableFeatures(object = pbmc)
pbmc <- ScaleData(object = pbmc)
pbmc <- RunPCA(object = pbmc)
pbmc <- FindNeighbors(object = pbmc)
pbmc <- FindClusters(object = pbmc)
pbmc <- RunTSNE(object = pbmc)
DimPlot(object = pbmc, reduction = "tsne")
以下详细展开某一步的功能
library(dplyr)
library(Seurat)
library(patchwork)
创建Seurat对象
Seurat接受counts文件作为输入(一般经过cellranger处理),创建包含细胞信息和counts信息的对象。
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "data/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
Seurat对象中的counts以稀疏矩阵的方式存储以节省内存,.
表示没有检测到counts
# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:10]
## 3 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
##
## CD3D 4 . 10 . . 1 2 3 1 .
## TCL1A . . . . . . . . 1 .
## MS4A1 . 6 . . . . . . 1 1
质控
一般而言,我们需要对数据进行质控以保证数据的质量,在进行后续的分析。常用的质控指标包括:
-
每个细胞的唯一基因数目
低质量或空液泡往往只能检测到少量基因
双液泡(doublet)或多液泡(multiplets)会具有异常多的基因数目
每个细胞的总counts数(相当于每个细胞的测序深度)
-
线粒体基因占比
- 低质量或死细胞会具有异常高的线粒体基因表达
由于每个细胞的基因数和测序深度在cellranger分析的时候已经计算过了,这里我们只需要再计算线粒体基因表达的比例即可
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
Seurat将细胞相关的元数据存储在 pbmc@meta.data
# Show QC metrics for the first 5 cells
head(pbmc@meta.data, 5)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
## AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
## AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
## AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
## AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
随后,我们过滤掉基因数(nFeature_RNA
)大于2500或小于200的细胞,以及线粒体基因组比例大于5%的细胞
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
需要注意的是这里的过滤标准在适合在这个数据集中使用,未必适用于其他的数据集。更好的质控方法是根据质控指标的分位数进行过滤,例如过滤掉 nFeature_RNA
上四分位数和下四分位数的细胞。
另外,这里只使用了三种指标对细胞进行质控,在实际分析中我们还可以使用其他工具进行更精密的质控,例如:
- SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data (https://github.com/constantAmateur/SoupX)
- DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors (https://github.com/chris-mcginnis-ucsf/DoubletFinder)
- DropletQC: improved identification of empty droplets and damaged cells in single-cell RNA-seq data (https://github.com/powellgenomicslab/DropletQC)
Normalization
在质控后,我们进行counts的normalization,默认使用 "LogNormalize" 的方法,即将每个基因的counts除以细胞总的counts数,乘上10,000,再进行对数转换。
pbmc <- NormalizeData(pbmc)
Seurat提供了另外的normalization方法,通过
normalization.method
指定, 包括:
"CLR": centered log ratio transformation
"RC": equals to "LogNormalize" without log-transformation
校正后的数据在 pbmc[["RNA"]]@data
pbmc[["RNA"]]@data[c("CD3D", "TCL1A", "MS4A1"), 1:10]
## 3 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
##
## CD3D 2.864242 . 3.489706 . . 1.726902 2.321937 2.658463 2.179642
## TCL1A . . . . . . . . 2.179642
## MS4A1 . 2.583047 . . . . . . 2.179642
##
## CD3D .
## TCL1A .
## MS4A1 2.309182
特征选择
Seurat选择在细胞细胞之间具有高度变异性的基因(例如某些细胞高表达,而其他细胞不表达)进行后续分析,这是由于这些基因可以代表细胞与细胞间的主要生物学差异
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot1 + plot2
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
默认选择前2000个高度变异基因。
数据缩放
Normalization后,需要对数据进行缩放(scaling)。 Scaling后,数据的均值为0,方差为1。
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
## Centering and scaling data matrix
scaled data存放在 pbmc[["RNA"]]@scale.data
pbmc[["RNA"]]@scale.data[c("CD3D", "TCL1A", "MS4A1"), 1:10]
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## CD3D 1.2509633 -0.9797929 1.7380926 -0.9797929
## TCL1A -0.3187677 -0.3187677 -0.3187677 -0.3187677
## MS4A1 -0.4110536 2.5965712 -0.4110536 -0.4110536
## AAACCGTGTATGCG-1 AAACGCACTGGTAC-1 AAACGCTGACCAGT-1 AAACGCTGGTTCTT-1
## CD3D -0.9797929 0.3651696 0.8286000 1.0906967
## TCL1A -0.3187677 -0.3187677 -0.3187677 -0.3187677
## MS4A1 -0.4110536 -0.4110536 -0.4110536 -0.4110536
## AAACGCTGTAGCCA-1 AAACGCTGTTTCTG-1
## CD3D 0.7177763 -0.9797929
## TCL1A 2.3330706 -0.3187677
## MS4A1 2.1268583 2.2776908
线性降维
Seurat 使用PCA进行降维,这里只对 FindVariableFeatures
挑选出的高变基因进行PCA分析
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
## PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
## CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
## MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
## BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
## TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
## PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
## HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
## NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A
## SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC
## GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL
## AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7
## LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5
## RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX
## Negative: LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B
## FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB
DimPlot(pbmc, reduction = "pca")
DimHeatmap(pbmc, dims = 1:5, cells = 500, balanced = TRUE)
维数选择
Seurat在主成分PC上进行聚类。然而直接对所有PC聚类是不现实的,我们需要选择足够的PC以代表数据的主要变异度,同时控制计算资源的开销。
因此,Seurat结合JackStraw程序和置换检验对PC进行显著性分析,鉴定出显著的PC以进行后续分析。
# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
PC11之后,PC的p-value就发生了迅速的上升,而变得不显著。
JackStrawPlot(pbmc, dims = 1:15)
## Warning: Removed 23496 rows containing missing values (geom_point).
我们还可以结合elbow plot进行判断,选择拐点和曲线平滑的PC
ElbowPlot(pbmc)
综上,我们选取前10个维度进行后续分析
细胞聚类
Seurat使用基于图的聚类算法对细胞进行聚类
FindNeighbors
中的 dims
参数指定聚类使用的维度
FindClusters
中的 resolution
参数指定类别的精度,越大则分出越多的类;越小则类别越少
pbmc <- FindNeighbors(pbmc, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95965
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8723
## Number of communities: 9
## Elapsed time: 0 seconds
table(Idents(pbmc))
##
## 0 1 2 3 4 5 6 7 8
## 711 480 472 344 279 162 144 32 14
非线性降维(UMAP/tSNE)
非线性降维捕捉数据内部的流式(manifold)以将细胞投射到低维空间中。
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 17:08:35 UMAP embedding parameters a = 0.9922 b = 1.112
## 17:08:35 Read 2638 rows and found 10 numeric columns
## 17:08:35 Using Annoy for neighbor search, n_neighbors = 30
## 17:08:35 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 17:08:35 Writing NN index file to temp file C:\Users\lda\AppData\Local\Temp\RtmpMlm7gA\file678867c8bcc
## 17:08:35 Searching Annoy index using 1 thread, search_k = 3000
## 17:08:36 Annoy recall = 100%
## 17:08:36 Commencing smooth kNN distance calibration using 1 thread
## 17:08:37 Initializing from normalized Laplacian + noise
## 17:08:37 Commencing optimization for 500 epochs, with 105124 positive edges
## 17:08:43 Optimization finished
pbmc <- RunTSNE(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap", label = TRUE)
DimPlot(pbmc, reduction = "tsne", label = TRUE)
UMAP和tSNE的降维效果不同,需要根据实际情况选择。
在这里可以保存中间数据,作为一个checkpoint
saveRDS(pbmc, file = "data/pbmc_tutorial.rds")
鉴定差异表达特征(cluster markers)
Seurat支持对cluster之间进行差异表达分析,主要有 FindMarkers
和 FindAllMarkers
两种方法。
这里鉴定cluster 5和cluster 0, 3之间的差异基因。如果不指定 ident.2
则鉴定cluster 5 与其余clusters的差异基因。
min.pct
指定差异基因需要在cluster中的表达占比
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FCGR3A 2.150929e-209 4.267579 0.975 0.039 2.949784e-205
## IFITM3 6.103366e-199 3.877105 0.975 0.048 8.370156e-195
## CFD 8.891428e-198 3.411039 0.938 0.037 1.219370e-193
## CD68 2.374425e-194 3.014535 0.926 0.035 3.256286e-190
## RP11-290F20.3 9.308287e-191 2.722684 0.840 0.016 1.276538e-186
FindAllMarkers
可以一次寻找所有clusters的markers,但只返回上调的markers
# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
pbmc.markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat.geom
## # A tibble: 18 x 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.17e- 83 1.33 0.435 0.108 1.60e- 79 0 CCR7
## 2 1.74e-109 1.07 0.897 0.593 2.39e-105 0 LDHB
## 3 0. 5.57 0.996 0.215 0. 1 S100A9
## 4 0. 5.48 0.975 0.121 0. 1 S100A8
## 5 7.99e- 87 1.28 0.981 0.644 1.10e- 82 2 LTB
## 6 2.61e- 59 1.24 0.424 0.111 3.58e- 55 2 AQP3
## 7 0. 4.31 0.936 0.041 0. 3 CD79A
## 8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
## 9 4.93e-169 3.01 0.595 0.056 6.76e-165 4 GZMK
## 10 1.17e-178 2.97 0.957 0.241 1.60e-174 4 CCL5
## 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
## 13 6.82e-175 4.92 0.958 0.135 9.36e-171 6 GNLY
## 14 1.05e-265 4.89 0.986 0.071 1.44e-261 6 GZMB
## 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 3.68e-110 8.58 1 0.024 5.05e-106 8 PPBP
## 18 7.73e-200 7.24 1 0.01 1.06e-195 8 PF4
Visualization
Seurat提供多种基因表达量可视化方法
- 小提琴图
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
- 细胞降维图
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
- 热图
pbmc.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
细胞注释
我们可以根据细胞marker基因的表达对细胞进行注释。虽然目前有一些自动注释的工具,但总的来说大家还是根据细胞的经典markers对细胞进行注释。
这里,我们根据教程中提供的cluster markers和细胞类型进行注释
Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 CD14, LYZ CD14+ Mono
2 IL7R, S100A4 Memory CD4+
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
saveRDS(pbmc, file = "data/pbmc3k_final.rds")
至此,Seurat分析的常规流程就结束了。
Ref:
Seurat - Guided Clustering Tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html