为了进一步分析,我们需要对数据进行标准化,以消除测序深度的影响。传统方法是将其缩放到10,000,然后对获得的值进行log2转换。标准化数据存储在“RNA” assay的srat[['RNA']]@data中。
> srat <- NormalizeData(srat)
下一步发现高变基因——这些通常是下游分析中最感兴趣的。
> srat <- FindVariableFeatures(srat, selection.method = "vst", nfeatures = 2000)
确定10个变异最大的基因:
> top10 <- head(VariableFeatures(srat), 10)
> top10
[1] "PTGDS" "IGLC3" "PPBP" "CXCL10" "GZMB" "GP1BB" "JCHAIN" "FCER1A"
[9] "IGKC" "MZB1"
绘制可变基因:
> plot1 <- VariableFeaturePlot(srat)
> LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)

ScaleData将标准化的基因表达量转换为Z-score(以0为中心且方差为1的值),它存储在srat[['RNA']]@scale.data中并用于后续PCA。默认仅对可变基因进行缩放。
> all.genes <- rownames(srat)
> srat <- ScaleData(srat, features = all.genes)
Centering and scaling data matrix
现在可以进行PCA,这是一种常见的线性降维方法。默认情况下,我们使用2000个高可变基因。
> srat <- RunPCA(srat, features = VariableFeatures(object = srat))
PC_ 1
Positive: FCN1, CST3, LYZ, FGL2, IFI30, MNDA, CTSS, TYMP, TYROBP, SERPINA1
TMEM176B, CYBB, S100A9, PSAP, NCF2, LST1, SPI1, AIF1, IGSF6, CD68
CSTA, GRN, TNFSF13B, CFD, S100A8, MPEG1, MS4A6A, FCER1G, CFP, DUSP6
Negative: LTB, IL32, TRAC, TRBC2, IL7R, CD7, LIME1, ARL4C, CD27, PRKCQ-AS1
CCR7, FCMR, CD247, LEF1, CD69, TRBC1, GZMM, MAL, TRABD2A, BCL2
SYNE2, ISG20, CTSW, IKZF3, ITM2A, RORA, AQP3, TRAT1, CD8B, KLRK1
PC_ 2
Positive: IGHM, MS4A1, CD79A, BANK1, SPIB, NIBAN3, BCL11A, IGKC, CD79B, LINC00926
TCF4, RALGPS2, AFF3, TNFRSF13C, IGHD, HLA-DQA1, TSPAN13, BLNK, CD22, PAX5
BLK, COBLL1, VPREB3, FCER2, JCHAIN, FCRLA, GNG7, HLA-DOB, TCL1A, LINC02397
Negative: IL32, TRAC, CD7, IL7R, CD247, GZMM, ANXA1, CTSW, S100A4, PRKCQ-AS1
TRBC1, S100A10, ITGB2, LEF1, KLRK1, RORA, GZMA, S100A6, NEAT1, MAL
CST7, NKG7, ID2, TRBC2, ARL4C, MT2A, SAMD3, PRF1, LIME1, TRAT1
PC_ 3
Positive: GZMB, CLIC3, C12orf75, LILRA4, NKG7, CLEC4C, SERPINF1, CST7, GZMA, GNLY
SCT, LRRC26, DNASE1L3, PRF1, TPM2, KLRD1, FGFBP2, CTSC, PACSIN1, IL3RA
HOPX, CCL5, PLD4, LINC00996, GZMH, CCL4, MAP1A, SMPD3, TNFRSF21, PTCRA
Negative: MS4A1, CD79A, LINC00926, BANK1, TNFRSF13C, IGHD, PAX5, RALGPS2, VPREB3, CD22
FCER2, CD79B, HLA-DOB, FCRL1, P2RX5, CD24, ARHGAP24, ADAM28, CCR7, SWAP70
FCRLA, LINC02397, CD19, IGHM, CD40, PKIG, FCRL2, BASP1, POU2AF1, BLK
PC_ 4
Positive: NKG7, GNLY, CST7, PRF1, KLRD1, GZMA, FGFBP2, HOPX, CCL4, FCGR3A
KLRF1, GZMH, CCL5, SPON2, CD160, ADGRG1, PTGDR, LAIR2, TRDC, RHOC
IFITM2, ABI3, MATK, TBX21, IL2RB, XCL2, PRSS23, FCRL6, CTSW, S1PR5
Negative: LILRA4, SERPINF1, CLEC4C, SCT, LRRC26, DNASE1L3, TPM2, MAP1A, TNFRSF21, PACSIN1
LINC00996, SCN9A, PTCRA, EPHB1, ITM2C, SMIM5, LAMP5, DERL3, CIB2, APP
IL3RA, SMPD3, PLEKHD1, SCAMP5, PLD4, ZFAT, PPM1J, GAS6, LGMN, TLR9
PC_ 5
Positive: S100A12, VCAN, ITGAM, CES1, S100A8, CYP1B1, PADI4, MGST1, MEGF9, MCEMP1
QPCT, GNLY, CD14, RNASE2, CSF3R, RBP7, NKG7, KLRD1, VNN2, CLEC4E
CRISPLD2, THBS1, PRF1, CST7, CKAP4, BST1, CTSD, CR1, FGFBP2, PGD
Negative: CDKN1C, HES4, CTSL, BATF3, TCF7L2, SIGLEC10, CSF1R, CKB, MS4A7, CALML4
FCGR3A, CASP5, RRAS, AC064805.1, MS4A4A, NEURL1, AC104809.2, IFITM3, MTSS1, SMIM25
CAMK1, GPBAR1, ABI3, HMOX1, ZNF703, FAM110A, RHOC, CXCL16, CALHM6, RNASET2
对于表现良好的数据集,主成分“载荷”应与不同细胞群的marker相匹配。可以使用ggplot2功能更改许多绘图参数并使用&运算符传递它们。
> VizDimLoadings(srat, dims = 1:9, reduction = "pca") &
theme(axis.text=element_text(size=5), axis.title=element_text(size=8,face="bold"))

或者同时绘制每个主成分的热图:
> DimHeatmap(srat, dims = 1:6, nfeatures = 20, cells = 500, balanced = T)

DimPlot用于可视化所有降维表示(PCA、tSNE、UMAP等)。Identity仍设置为“orig.ident”。DimPlot具有内置的降维层次结构,它的绘制优先级为UMAP>tSNE>PCA。
> DimPlot(srat, reduction = "pca")

找出可以使用多少PC又没有太多信息丢失通常是有利的。在我们的例子中,PC10的下降幅度很大,因此这似乎是一个不错的初始选择:
> ElbowPlot(srat)

我们现在可以进行聚类。高分辨率会产生更多的类(默认值为0.8)。找到正确的聚类分辨率非常重要,因为细胞类型marker取决于定义的cluster。
> srat <- FindNeighbors(srat, dims = 1:10)
Computing nearest neighbor graph
Computing SNN
> srat <- FindClusters(srat, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 10194
Number of edges: 337796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9082
Number of communities: 13
Elapsed time: 1 seconds
为了进行可视化,我们还需要生成UMAP降维:
> srat <- RunUMAP(srat, dims = 1:10, verbose = F)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
聚类完成后,活动状态的identity将重置为“seurat_clusters”。让我们看看聚类情况。
> table(srat@meta.data$seurat_clusters)
0 1 2 3 4 5 6 7 8 9 10 11 12
2854 1671 1163 1079 851 774 426 402 381 245 178 134 36
DimPlot默认使用UMAP,以Seurat聚类作为identity:
> DimPlot(srat,label.size = 4,repel = T,label = T)

为了控制聚类分辨率和其他可能的影响因素,我们将仔细观察两个小细胞群:1)树突状细胞(DC),2)血小板。让我们看一下每种细胞类型的两个marker:DC的LILRA4和TPM2,以及血小板的PPBP和GP1BB。
> FeaturePlot(srat, features = c("LILRA4", "TPM2", "PPBP", "GP1BB"))

想象一下其他的混杂因素:
> FeaturePlot(srat, features = "Doublet_score") & theme(plot.title = element_text(size=10))

> FeaturePlot(srat, features = "percent.mt") & theme(plot.title = element_text(size=10))

> FeaturePlot(srat, features = "nFeature_RNA") & theme(plot.title = element_text(size=10))

让我们删除未通过QC的细胞并比较两个图。现在我们可以看到更明确的cluster。过滤后的数据集包含8,824个细胞,大约12%的细胞由于各种原因被删除。
> DimPlot(srat,label.size = 4,repel = T,label = T)

> srat <- subset(srat, subset = QC == 'Pass')
> DimPlot(srat,label.size = 4,repel = T,label = T)

最后,我们来计算细胞周期分数。这必须在标准化和中心化后完成。Seurat有一个内置列表,cc.genes(旧版)和 cc.genes.updated.2019(新版),用于定义与细胞周期有关的基因。对于CellRanger参考基因组GRCh38 v2.0.0及以上版本,请使用cc.genes.updated.2019(三个基因被重命名:MLF1IP、FAM64A和HN1变为CENPU、PICALM和JPT)。对于小鼠细胞周期基因,可以使用https://github.com/satijalab/seurat/issues/2493详述的解决方案。
> cc.genes.updated.2019
$s.genes
[1] "MCM5" "PCNA" "TYMS" "FEN1" "MCM7" "MCM4"
[7] "RRM1" "UNG" "GINS2" "MCM6" "CDCA7" "DTL"
[13] "PRIM1" "UHRF1" "CENPU" "HELLS" "RFC2" "POLR1B"
[19] "NASP" "RAD51AP1" "GMNN" "WDR76" "SLBP" "CCNE2"
[25] "UBR7" "POLD3" "MSH2" "ATAD2" "RAD51" "RRM2"
[31] "CDC45" "CDC6" "EXO1" "TIPIN" "DSCC1" "BLM"
[37] "CASP8AP2" "USP1" "CLSPN" "POLA1" "CHAF1B" "MRPL36"
[43] "E2F8"
$g2m.genes
[1] "HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5" "TPX2" "TOP2A"
[8] "NDC80" "CKS2" "NUF2" "CKS1B" "MKI67" "TMPO" "CENPF"
[15] "TACC3" "PIMREG" "SMC4" "CCNB2" "CKAP2L" "CKAP2" "AURKB"
[22] "BUB1" "KIF11" "ANP32E" "TUBB4B" "GTSE1" "KIF20B" "HJURP"
[29] "CDCA3" "JPT1" "CDC20" "TTK" "CDC25C" "KIF2C" "RANGAP1"
[36] "NCAPD2" "DLGAP5" "CDCA2" "CDCA8" "ECT2" "KIF23" "HMMR"
[43] "AURKA" "PSRC1" "ANLN" "LBR" "CKAP5" "CENPE" "CTCF"
[50] "NEK2" "G2E3" "GAS2L3" "CBX5" "CENPA"
> s.genes <- cc.genes.updated.2019$s.genes
> g2m.genes <- cc.genes.updated.2019$g2m.genes
> srat <- CellCycleScoring(srat, s.features = s.genes, g2m.features = g2m.genes)
> table(srat[[]]$Phase)
G1 G2M S
5431 977 2416
让我们看看是否有因技术差异产生的cluster。线粒体基因表现出与cluster的相关性,在cluster 12中要低得多。
> FeaturePlot(srat,features = "percent.mt",label.size = 4,repel = T,label = T) &
theme(plot.title = element_text(size=10))

> VlnPlot(srat,features = "percent.mt") & theme(plot.title = element_text(size=10))

核糖体蛋白基因对假定的细胞类型表现出非常强的相关性,一些cluster有高达45%的相关性,而有些则只有15%。
> FeaturePlot(srat,features = "percent.rb",label.size = 4,repel = T,label = T) &
theme(plot.title = element_text(size=10))

> VlnPlot(srat,features = "percent.rb") & theme(plot.title = element_text(size=10))

每种细胞类型的RNA含量也存在差异。Cluster 6和7可能是质量较低的细胞,当我们使用QC过滤的数据重新进行聚类时,它们会消失。
> VlnPlot(srat,features = c("nCount_RNA","nFeature_RNA")) &
theme(plot.title = element_text(size=10))

细胞周期评分似乎并不太依赖于细胞类型,但每个cluster都有显著的异常值。
> FeaturePlot(srat,features = c("S.Score","G2M.Score"),label.size = 4,repel = T,label = T) &
theme(plot.title = element_text(size=10))

> VlnPlot(srat,features = c("S.Score","G2M.Score")) &
theme(plot.title = element_text(size=10))
