重生之我在剑桥大学学习单细胞RNA-seq分析——7. 使用Seurat进行单细胞RNA测序分析(2)

为了进一步分析,我们需要对数据进行标准化,以消除测序深度的影响。传统方法是将其缩放到10,000,然后对获得的值进行log2转换。标准化数据存储在“RNA” assay的srat[['RNA']]@data中。

> srat <- NormalizeData(srat)

下一步发现高变基因——这些通常是下游分析中最感兴趣的。

> srat <- FindVariableFeatures(srat, selection.method = "vst", nfeatures = 2000)

确定10个变异最大的基因:

> top10 <- head(VariableFeatures(srat), 10)
> top10
 [1] "PTGDS"  "IGLC3"  "PPBP"   "CXCL10" "GZMB"   "GP1BB"  "JCHAIN" "FCER1A"
 [9] "IGKC"   "MZB1"

绘制可变基因:

> plot1 <- VariableFeaturePlot(srat)
> LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)

ScaleData将标准化的基因表达量转换为Z-score(以0为中心且方差为1的值),它存储在srat[['RNA']]@scale.data中并用于后续PCA。默认仅对可变基因进行缩放。

> all.genes <- rownames(srat)
> srat <- ScaleData(srat, features = all.genes)
Centering and scaling data matrix

现在可以进行PCA,这是一种常见的线性降维方法。默认情况下,我们使用2000个高可变基因。

> srat <- RunPCA(srat, features = VariableFeatures(object = srat))
PC_ 1 
Positive:  FCN1, CST3, LYZ, FGL2, IFI30, MNDA, CTSS, TYMP, TYROBP, SERPINA1 
           TMEM176B, CYBB, S100A9, PSAP, NCF2, LST1, SPI1, AIF1, IGSF6, CD68 
           CSTA, GRN, TNFSF13B, CFD, S100A8, MPEG1, MS4A6A, FCER1G, CFP, DUSP6 
Negative:  LTB, IL32, TRAC, TRBC2, IL7R, CD7, LIME1, ARL4C, CD27, PRKCQ-AS1 
           CCR7, FCMR, CD247, LEF1, CD69, TRBC1, GZMM, MAL, TRABD2A, BCL2 
           SYNE2, ISG20, CTSW, IKZF3, ITM2A, RORA, AQP3, TRAT1, CD8B, KLRK1 
PC_ 2 
Positive:  IGHM, MS4A1, CD79A, BANK1, SPIB, NIBAN3, BCL11A, IGKC, CD79B, LINC00926 
           TCF4, RALGPS2, AFF3, TNFRSF13C, IGHD, HLA-DQA1, TSPAN13, BLNK, CD22, PAX5 
           BLK, COBLL1, VPREB3, FCER2, JCHAIN, FCRLA, GNG7, HLA-DOB, TCL1A, LINC02397 
Negative:  IL32, TRAC, CD7, IL7R, CD247, GZMM, ANXA1, CTSW, S100A4, PRKCQ-AS1 
           TRBC1, S100A10, ITGB2, LEF1, KLRK1, RORA, GZMA, S100A6, NEAT1, MAL 
           CST7, NKG7, ID2, TRBC2, ARL4C, MT2A, SAMD3, PRF1, LIME1, TRAT1 
PC_ 3 
Positive:  GZMB, CLIC3, C12orf75, LILRA4, NKG7, CLEC4C, SERPINF1, CST7, GZMA, GNLY 
           SCT, LRRC26, DNASE1L3, PRF1, TPM2, KLRD1, FGFBP2, CTSC, PACSIN1, IL3RA 
           HOPX, CCL5, PLD4, LINC00996, GZMH, CCL4, MAP1A, SMPD3, TNFRSF21, PTCRA 
Negative:  MS4A1, CD79A, LINC00926, BANK1, TNFRSF13C, IGHD, PAX5, RALGPS2, VPREB3, CD22 
           FCER2, CD79B, HLA-DOB, FCRL1, P2RX5, CD24, ARHGAP24, ADAM28, CCR7, SWAP70 
           FCRLA, LINC02397, CD19, IGHM, CD40, PKIG, FCRL2, BASP1, POU2AF1, BLK 
PC_ 4 
Positive:  NKG7, GNLY, CST7, PRF1, KLRD1, GZMA, FGFBP2, HOPX, CCL4, FCGR3A 
           KLRF1, GZMH, CCL5, SPON2, CD160, ADGRG1, PTGDR, LAIR2, TRDC, RHOC 
           IFITM2, ABI3, MATK, TBX21, IL2RB, XCL2, PRSS23, FCRL6, CTSW, S1PR5 
Negative:  LILRA4, SERPINF1, CLEC4C, SCT, LRRC26, DNASE1L3, TPM2, MAP1A, TNFRSF21, PACSIN1 
           LINC00996, SCN9A, PTCRA, EPHB1, ITM2C, SMIM5, LAMP5, DERL3, CIB2, APP 
           IL3RA, SMPD3, PLEKHD1, SCAMP5, PLD4, ZFAT, PPM1J, GAS6, LGMN, TLR9 
PC_ 5 
Positive:  S100A12, VCAN, ITGAM, CES1, S100A8, CYP1B1, PADI4, MGST1, MEGF9, MCEMP1 
           QPCT, GNLY, CD14, RNASE2, CSF3R, RBP7, NKG7, KLRD1, VNN2, CLEC4E 
           CRISPLD2, THBS1, PRF1, CST7, CKAP4, BST1, CTSD, CR1, FGFBP2, PGD 
Negative:  CDKN1C, HES4, CTSL, BATF3, TCF7L2, SIGLEC10, CSF1R, CKB, MS4A7, CALML4 
           FCGR3A, CASP5, RRAS, AC064805.1, MS4A4A, NEURL1, AC104809.2, IFITM3, MTSS1, SMIM25 
           CAMK1, GPBAR1, ABI3, HMOX1, ZNF703, FAM110A, RHOC, CXCL16, CALHM6, RNASET2

对于表现良好的数据集,主成分“载荷”应与不同细胞群的marker相匹配。可以使用ggplot2功能更改许多绘图参数并使用&运算符传递它们。

> VizDimLoadings(srat, dims = 1:9, reduction = "pca") & 
    theme(axis.text=element_text(size=5), axis.title=element_text(size=8,face="bold"))

或者同时绘制每个主成分的热图:

> DimHeatmap(srat, dims = 1:6, nfeatures = 20, cells = 500, balanced = T)

DimPlot用于可视化所有降维表示(PCA、tSNE、UMAP等)。Identity仍设置为“orig.ident”。DimPlot具有内置的降维层次结构,它的绘制优先级为UMAP>tSNE>PCA。

> DimPlot(srat, reduction = "pca")

找出可以使用多少PC又没有太多信息丢失通常是有利的。在我们的例子中,PC10的下降幅度很大,因此这似乎是一个不错的初始选择:

> ElbowPlot(srat) 

我们现在可以进行聚类。高分辨率会产生更多的类(默认值为0.8)。找到正确的聚类分辨率非常重要,因为细胞类型marker取决于定义的cluster。

> srat <- FindNeighbors(srat, dims = 1:10)
Computing nearest neighbor graph
Computing SNN
> srat <- FindClusters(srat, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10194
Number of edges: 337796

Running Louvain algorithm...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9082
Number of communities: 13
Elapsed time: 1 seconds

为了进行可视化,我们还需要生成UMAP降维:

> srat <- RunUMAP(srat, dims = 1:10, verbose = F)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session

聚类完成后,活动状态的identity将重置为“seurat_clusters”。让我们看看聚类情况。

> table(srat@meta.data$seurat_clusters)

   0    1    2    3    4    5    6    7    8    9   10   11   12 
2854 1671 1163 1079  851  774  426  402  381  245  178  134   36

DimPlot默认使用UMAP,以Seurat聚类作为identity:

> DimPlot(srat,label.size = 4,repel = T,label = T)

为了控制聚类分辨率和其他可能的影响因素,我们将仔细观察两个小细胞群:1)树突状细胞(DC),2)血小板。让我们看一下每种细胞类型的两个marker:DC的LILRA4和TPM2,以及血小板的PPBP和GP1BB。

> FeaturePlot(srat, features = c("LILRA4", "TPM2", "PPBP", "GP1BB"))

想象一下其他的混杂因素:

> FeaturePlot(srat, features = "Doublet_score") & theme(plot.title = element_text(size=10)) 
> FeaturePlot(srat, features = "percent.mt") & theme(plot.title = element_text(size=10))
> FeaturePlot(srat, features = "nFeature_RNA") & theme(plot.title = element_text(size=10))

让我们删除未通过QC的细胞并比较两个图。现在我们可以看到更明确的cluster。过滤后的数据集包含8,824个细胞,大约12%的细胞由于各种原因被删除。

> DimPlot(srat,label.size = 4,repel = T,label = T)
> srat <- subset(srat, subset = QC == 'Pass')
> DimPlot(srat,label.size = 4,repel = T,label = T)

最后,我们来计算细胞周期分数。这必须在标准化和中心化后完成。Seurat有一个内置列表,cc.genes(旧版)和 cc.genes.updated.2019(新版),用于定义与细胞周期有关的基因。对于CellRanger参考基因组GRCh38 v2.0.0及以上版本,请使用cc.genes.updated.2019(三个基因被重命名:MLF1IP、FAM64A和HN1变为CENPU、PICALM和JPT)。对于小鼠细胞周期基因,可以使用https://github.com/satijalab/seurat/issues/2493详述的解决方案。

> cc.genes.updated.2019
$s.genes
 [1] "MCM5"     "PCNA"     "TYMS"     "FEN1"     "MCM7"     "MCM4"    
 [7] "RRM1"     "UNG"      "GINS2"    "MCM6"     "CDCA7"    "DTL"     
[13] "PRIM1"    "UHRF1"    "CENPU"    "HELLS"    "RFC2"     "POLR1B"  
[19] "NASP"     "RAD51AP1" "GMNN"     "WDR76"    "SLBP"     "CCNE2"   
[25] "UBR7"     "POLD3"    "MSH2"     "ATAD2"    "RAD51"    "RRM2"    
[31] "CDC45"    "CDC6"     "EXO1"     "TIPIN"    "DSCC1"    "BLM"     
[37] "CASP8AP2" "USP1"     "CLSPN"    "POLA1"    "CHAF1B"   "MRPL36"  
[43] "E2F8"    

$g2m.genes
 [1] "HMGB2"   "CDK1"    "NUSAP1"  "UBE2C"   "BIRC5"   "TPX2"    "TOP2A"  
 [8] "NDC80"   "CKS2"    "NUF2"    "CKS1B"   "MKI67"   "TMPO"    "CENPF"  
[15] "TACC3"   "PIMREG"  "SMC4"    "CCNB2"   "CKAP2L"  "CKAP2"   "AURKB"  
[22] "BUB1"    "KIF11"   "ANP32E"  "TUBB4B"  "GTSE1"   "KIF20B"  "HJURP"  
[29] "CDCA3"   "JPT1"    "CDC20"   "TTK"     "CDC25C"  "KIF2C"   "RANGAP1"
[36] "NCAPD2"  "DLGAP5"  "CDCA2"   "CDCA8"   "ECT2"    "KIF23"   "HMMR"   
[43] "AURKA"   "PSRC1"   "ANLN"    "LBR"     "CKAP5"   "CENPE"   "CTCF"   
[50] "NEK2"    "G2E3"    "GAS2L3"  "CBX5"    "CENPA"  

> s.genes <- cc.genes.updated.2019$s.genes
> g2m.genes <- cc.genes.updated.2019$g2m.genes
> srat <- CellCycleScoring(srat, s.features = s.genes, g2m.features = g2m.genes)
> table(srat[[]]$Phase)

  G1  G2M    S 
5431  977 2416

让我们看看是否有因技术差异产生的cluster。线粒体基因表现出与cluster的相关性,在cluster 12中要低得多。

> FeaturePlot(srat,features = "percent.mt",label.size = 4,repel = T,label = T) & 
    theme(plot.title = element_text(size=10))
> VlnPlot(srat,features = "percent.mt") & theme(plot.title = element_text(size=10))

核糖体蛋白基因对假定的细胞类型表现出非常强的相关性,一些cluster有高达45%的相关性,而有些则只有15%。

> FeaturePlot(srat,features = "percent.rb",label.size = 4,repel = T,label = T) & 
    theme(plot.title = element_text(size=10)) 
> VlnPlot(srat,features = "percent.rb") & theme(plot.title = element_text(size=10))

每种细胞类型的RNA含量也存在差异。Cluster 6和7可能是质量较低的细胞,当我们使用QC过滤的数据重新进行聚类时,它们会消失。

> VlnPlot(srat,features = c("nCount_RNA","nFeature_RNA")) & 
    theme(plot.title = element_text(size=10))

细胞周期评分似乎并不太依赖于细胞类型,但每个cluster都有显著的异常值。

> FeaturePlot(srat,features = c("S.Score","G2M.Score"),label.size = 4,repel = T,label = T) & 
    theme(plot.title = element_text(size=10))
> VlnPlot(srat,features = c("S.Score","G2M.Score")) & 
    theme(plot.title = element_text(size=10))

往期内容:
重生之我在剑桥大学学习单细胞RNA-seq分析——7. 使用Seurat进行单细胞RNA测序分析(1)

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容