本教程介绍了Seurat包与sctransform 包一起分析时候的一般方法。首先你也许会问:sctransform包是用来干嘛的?
这是一个值得关心和探讨的问题,我也将做一个简单的回答。最好的办法当然是到包所在的官网去读人家的自我介绍了sctransform: Variance Stabilizing Transformations for Single Cell UMI Data
A normalization method for single-cell UMI count data using a variance stabilizing transformation. The transformation is based on a negative binomial regression model with regularized parameters. As part of the same regression framework, this package also provides functions for batch correction, and data correction. See Hafemeister and Satija 2019 <doi:10.1101/576827> for more details.
有没有感觉很强大,在Seurat里面被封为SCTransform()函数可以:
- 代替 NormalizeData, ScaleData, and FindVariableFeatures.三个函数
- 转换完了在SCT assay 里
- 在均一化的同时可以移除线粒体细胞等的影响
library(Seurat)
packageVersion("Seurat")
# https://satijalab.org/seurat/mca.html
library(dplyr)
library(ggsci)
library(ggplot2)
library(sctransform)
# Load the PBMC dataset
list.files("D:\\Users\\Administrator\\Desktop\\RStudio\\single_cell\\filtered_gene_bc_matrices\\hg19")
?Read10X
pbmc.data <- Read10X(data.dir = "D:\\Users\\Administrator\\Desktop\\RStudio\\single_cell\\filtered_gene_bc_matrices\\hg19")
# Initialize the Seurat object with the raw (non-normalized data).
?CreateSeuratObject
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
一步数据均一化,也是比较慢的。
# store mitochondrial percentage in object meta data
pbmc <- PercentageFeatureSet(pbmc, pattern = "^MT-", col.name = "percent.mt")
# run sctransform
pbmc <- SCTransform(pbmc, vars.to.regress = "percent.mt", verbose = FALSE)
pbmc
An object of class Seurat
26286 features across 2700 samples within 2 assays
Active assay: SCT (12572 features)
1 other assay present: RNA
然后是标准操作。
# These are now standard steps in the Seurat workflow for visualization and clustering
pbmc <- RunPCA(pbmc, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:30, verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:30, verbose = FALSE)
pbmc <- FindClusters(pbmc, verbose = FALSE)
DimPlot(pbmc, label = TRUE) + NoLegend()
为什么在转化之后PC轴要比之前我们用NormalizeData 的时候要多呢?
- 更有效地消除技术影响
- 更高的PC维度可能代表一些微妙的生物学差异
- In addition, sctransform returns 3,000 variable features by default, instead of 2,000. The rationale is similar, the additional variable features are less likely to be driven by technical differences across cells, and instead may represent more subtle biological fluctuations.
我们还可以像下面这样去运行:
pbmc <- CreateSeuratObject(pbmc_data) %>% PercentageFeatureSet(pattern = "^MT-", col.name = "percent.mt") %>%
SCTransform(vars.to.regress = "percent.mt") %>% RunPCA() %>% FindNeighbors(dims = 1:30) %>%
RunUMAP(dims = 1:30) %>% FindClusters()
Where are normalized values stored for sctransform?
- pbmc[["SCT"]]@scale.data
- pbmc[["SCT"]]@data 用于可视化的数据
- pbmc[["SCT"]]@counts 校正后的count
- You can use the corrected log-normalized counts for differential expression and integration. However, in principle, it would be most optimal to perform these calculations directly on the residuals (stored in the scale.data slot) themselves. This is not currently supported in Seurat v3, but will be soon.
第四条很容易让人困惑,先不说技术上的,仅字面意思:这里给你的并不是最好的,最好的目前还没有。这句话不能和最好的方法出来的时候再说吗?以后的方法比现在的好,这很容易理解啊,但是你要告诉我最好的总在未来,我就纠结了。而且,FAQ4中建议使用RNA assay做差异分析,而在这里有建议使用SCT assay的scale.data(而且This is not currently supported in Seurat v3, but will be soon.)。
官方的解释是这样的:
Hi,
Thanks for the question, and I apologize for the confusion. We're working on allowing for DE to be performed on pearson residuals from SCTransform in an optimal way. Until then, its easiest for us to advise users just to use the RNA assay. But if you're really excited to give it a try, it is not invalid to do so. Still, in the interest of simplicity, we'll keep the FAQ as-is.
best,
Rahul
https://github.com/satijalab/seurat/issues/1421
# These are now standard steps in the Seurat workflow for visualization and clustering Visualize
# canonical marker genes as violin plots.
VlnPlot(pbmc, features = c("CD8A", "GZMK", "CCL5", "S100A4", "ANXA1", "CCR7", "ISG15", "CD3D"),
pt.size = 0.2, ncol = 4)
# Visualize canonical marker genes on the sctransform embedding.
FeaturePlot(pbmc, features = c("CD8A", "GZMK", "CCL5", "S100A4", "ANXA1", "CCR7"), pt.size = 0.2,
ncol = 3)
FeaturePlot(pbmc, features = c("CD3D", "ISG15", "TCL1A", "FCER2", "XCL1", "FCGR3A"), pt.size = 0.2,
ncol = 3)