质控来源:
质控步骤
1. 下载GSE152048基因表达矩阵(BC3)
导入后结果:
路径:D:\keti _zaiyan_english\course3\BC3
2. 安装seurat包
3. 读取数据
Seurat function:Read10x——Load in data from 10X
data.dir
Directory containing the matrix,mtx, genes.tsv (or features.tsv), and barcodes.tsv files provided by 10X. A vector or named vector can be given in order to load several data directories. If a named vector is given, the cell barcode names will be prefixed with the name.
gene.column
Specify which column of genes.tsv or features.tsv to use for gene names; default is 2
cell.column
Specify which column of barcodes.tsv to use for cell names; default is 1
unique.features
Make feature names unique (default TRUE)
strip.suffix
Remove trailing "-1" if present in all cell barcodes.
4. 查看seurat.count的数据结构
class(scRNA.counts)
[1] "dgCMatrix"#稀疏矩阵
5. 创建seurat对象(S4对象)
#CreateSeuratObject是创建S4对象的function
scRNA = CreateSeuratObject(scRNA.counts,min.cells = 3, project="os", min.features = 300)
View(scRNA)
#第一种S4对象的提取方法:点击白框
#第二种S4对象的提取方法:@和$交替使用
count = scRNA@assays$RNA@counts
scRNA@meta.data
?CreateSeuratObject
counts
Either a matrix-like object with unnormalized data with cells as columns and features as rowms or an Assay-derived object
#非标准化,基因为行名,细胞为列名的表达矩阵
min.cells
Include features detected in at least this many cells.
#基因至少要在多少个细胞内被检测到,基因至少要在几个细胞中表达
min.features
Include cells where at least this many features are detected.
#有多少个基因能被检测到的细胞,min.features = 300指每个细胞至少检测到300个gene
- S3对象和S4对象
s3对象包括 data.frame list character matirx
s4为层级结构
s4对象的索引
6. 查看sample的细胞数量
table(scRNA@meta.data$orig.ident)
os
8684
7. 计算质控指标
?PercentageFeatureSet
——Calculate the percentage of all counts that belong to a given set of features
#计算细胞中线粒体基因比例
scRNA[[]] #默认索引提取metadata!!!seurat的结构设计
scRNA[["percent.mt"]] <- PercentageFeatureSet(scRNA, pattern = "^MT-")#MT-是线粒体基因的开头;正则表达式;在metadata的dataframe里加一列线粒体的百分比
HB.genes <- c("HBA1","HBA2","HBB","HBD","HBE1","HBG1","HBG2","HBM","HBQ1","HBZ")
#红细胞基因包括...成熟红细胞没有细胞核,应去除此类基因
HB_m <- match(HB.genes, rownames(scRNA@assays$RNA))
#匹配rowname,即基因名
HB.genes <- rownames(scRNA@assays$RNA)[HB_m]
HB.genes <- HB.genes[!is.na(HB.genes)] #!(删除)判断HB.gene为True的基因
scRNA[["percent.HB"]]<-PercentageFeatureSet(scRNA, features=HB.genes)
#对这几个HBgene计算红细胞比例
#head(scRNA@meta.data)
col.num <- length(levels(scRNA@active.ident))
【R语言正则表达式】
正则表达式符号 | 含义 |
---|---|
^ | 匹配一个字符串的开始 |
$ | 匹配一个字符串的结束 |
. | 匹配除了换行符以外的任意字符 |
* | 匹配所有含有*后的字符 |
? | 匹配所有含有?后的字符 |
+ | 匹配所有含有+后的字符 |
.* | 匹配任意字符 |
8. 画图
#画图
violin <- VlnPlot(scRNA,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt","percent.HB"),
cols =rainbow(col.num),
pt.size = 0.01,
ncol = 4) +
theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())
violin
-
结果
Feature、count、线粒体基因、红细胞基因占比可视化。
9. 质控
nature原文标准:当(1) total UMI counts of no more than 1,000; (2) gene numbers no higher than 500; (3) mitochondrial gene percentage of greater than 20.—删除该细胞
线粒体含量高:细胞活性状态异常,凋亡状态;但比如肝脏细胞代谢活跃,阈值可以稍微提高
【UMI用于标记转录本:UMI是一段12nt的核苷酸序列,每一个 mRNA,随机连上一个 UMI,因此可以计数不同的 UMI,最终计数 mRNA 的数量。
UMI的作用是绝对定量,因为每个mRNA的扩增效率是不一样的,即使两个初始的mRNA表达量一致,在多轮扩增之后,我们也会误认为他们差异表达。UMI通过初始的标记,让我们可以统计扩增后UMI的种类,就可以知道原始的表达量了】
【10xBarcode是一段16nt的核苷酸序列,用来标记细胞:在Barcode与细胞融合形成水凝珠之后,可以保证一个细胞的所有基因序列都带着相同的Barcode序列,也就可以认定这些序列来自同一个细胞。】
10. 质控后和质控前图对比
ggsave("vlnplot_before_qc.pdf", plot = violin, width = 12, height = 6)
ggsave("vlnplot_before_qc.png", plot = violin, width = 12, height = 6)
plot1=FeatureScatter(scRNA, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2=FeatureScatter(scRNA, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot3=FeatureScatter(scRNA, feature1 = "nCount_RNA", feature2 = "percent.HB")
pearplot <- CombinePlots(plots = list(plot1, plot2, plot3), nrow=1, legend="none")
plot1
plot2
plot3
pearplot
-
结果
pearplot;随着测序数量的增加,单细胞所检测到的基因数量也在增加,其中,这两个是有一定关联性的,在作图之前需要对含有线粒体基因的细胞进行筛选,因为过高会干扰细胞分群;图1是对于线粒体比例于测序数据的关系的散点图即测序深度和线粒体基因含量的关系,图2是对基因于测序数据序列的关系的散点图,即测序深度于基因数量的关系
scRNA1 <- subset(scRNA, subset = nFeature_RNA > 300& nFeature_RNA < 7000 & percent.mt < 10 & percent.HB < 3 & nCount_RNA < 100000)
#!非;&和;|或;需同时满足条件
#注意:默认线粒体含量至少要小于20%,这是根据生物学知识得出的默认阈值。红细胞的数目要至少小于5%;至于nFeature_RNA和nCount_RNA的阈值怎么确定,这个要结合pearplot的图来判断、我们质控的目标就是删除离异值。而且注意阈值尽可能取的宽松一下,防止后面分析想要的细胞得不到→接下来从pearplot的图片来做质控---剔除离异值
11. 标准化
- 基本概念:
- 文库大小(library size):指的是每个细胞中所有基因的表达量之和。细胞的文库如果很小,说明在文库制备过程中存在RNA的损失要么是由于细胞裂解,要么是cDNA捕获不足导致后续的扩增量少。
- 每个细胞中表达基因的数目(number of expressed features in each cell):指的是细胞中非0表达量的基因数量。如果细胞中基本没有基因表达,很可能是转录本捕获失败。
- 比对到线粒体基因组的reads比例:比对到线粒体的reads增多,说明细胞质中的RNA减少,可能存在细胞穿孔的情况,而这个孔的大小,可能只是将细胞质中存在的mRNA流出去,但线粒体的体积超过了孔的大小,所以还留在了细胞中,造成一定程度的富集,导致线粒体RNA占比升高。
- 过滤的目的:就是去除低质量的细胞,比如线粒体含量超过20%,还有基因种类低于两百的,依据具体情况来定。
nature原文:The expression level of each gene in each cell was normal-
ized using the function NormalizeData with the default parameters
to remove the influence of sequencing library size, which converted
expression values from UMI counts to ln[10,000 × UMI counts/total
UMI counts in cell + 1
Seuat的默认标准化方法: 每个细胞的某一count加1然后先除以该细胞的总count,然后乘以scale因子 -10000,再做个对数转换。
?NormalizeData
object——An object
Method for normalization:
- LogNormalize: Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor. This is then natural-log transformed using log1p.
- CLR: Applies a centered log ratio transformation
- RC: Relative counts. Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor. No log-transformation is applied. For counts per million (CPM) set scale.factor = 1e6
- normalize的方法
scRNA1 <- NormalizeData(scRNA1, normalization.method = "LogNormalize", scale.factor = 10000)
save(scRNA1,file='scRNA1.Rdata')