10X单细胞转录组是可以联合分析多个物种的。也许,你会问为什么要分析多个物种,我一个物种的某个器官还整明白呢?其实在做:
- 模型
- 移植
- 免疫
- 病毒
- 生殖隔离
的时候,同一个组织的细胞并不是同一个来源的,或者同一个细胞的mRNA本身有不同的来源(如病毒侵染的细胞)。这就需要确定不同来源的比例及其演化关系。就拿最近的新冠肺炎来说吧,在单细胞水平上做的一个关键就是区分细胞内宿主和病毒基因的表达。
今天我们以10X平台下人和小鼠为例介绍一下多物种联合分析。
参考基因组
众所周知,在单一物种中我们是基于比对来给UMI定量的。那么,多个物种中的一个关键就是构建这样一个多物种的基因组,用于定量。也许我们要看一下基因组到底是什么了:
- fa序列
- 注释信息
其实基因组主要的组成就这两部分,当然又很所指标来衡量,比如组装的水平啊,注释的是否详细啊。每个基因组都有这两个文件,质言之,我们按照某种规则把它们分别合并到一起就可以了。10X给出了规范程序,构建的基因组可以直接用于cellranger count
:
cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf \
--genome=mm10 --fasta=mm10.fa --genes=mm10-filtered-ensembl.gtf
这也说明基因组作为基础研究的重要性。
cellranger 结果
配置好基因组文件后面的分析就可以跑cellranger了,结果文件是这样:
序列的基本信息:
比对信息:
样本和GEM信息:
在seurat中分析
读入数据
library(Seurat)
library(tidyverse)
pbmc <-Read10X("filtered_feature_bc_matrix")
pbmc<- CreateSeuratObject(pbmc)
pbmc
An object of class Seurat
112137 features across 1230 samples within 1 assay
Active assay: RNA (112137 features)
查看数据
> tail(rownames(pbmc),3)
[1] "mm10-CAAA01064564.1" "mm10-Vmn2r122" "mm10-CAAA01147332.1"
> head(rownames(pbmc),3)
[1] "hg19-DDX11L1" "hg19-WASH7P" "hg19-MIR1302-10"
> head(pbmc@meta.data)
orig.ident nCount_RNA nFeature_RNA
AAACCCAGTTGAAGTA SeuratProject 5531 2160
AAACCCATCGAGCTGC SeuratProject 42194 7249
AAACGAAAGTTCCGGC SeuratProject 24685 4740
AAACGAACACTTTATC SeuratProject 48421 7164
AAACGAACATCAGTGT SeuratProject 60559 8136
AAAGGATTCTTCGATT SeuratProject 6478 1269
可以看出基因名前面区分了人和小鼠,但是细胞并没有区分出来,因为前面我们说过,有的细胞中可能是有不同物种来源的基因的。
计算每个细胞人和小鼠基因的占比
> pbmc[["prop_mouse"]] <- PercentageFeatureSet(pbmc, pattern = "^mm10-")
> pbmc[["prop_human"]] <- PercentageFeatureSet(pbmc, pattern = "^hg19-")
> head(pbmc@meta.data)
orig.ident nCount_RNA nFeature_RNA prop_mouse prop_human
AAACCCAGTTGAAGTA SeuratProject 5531 2160 97.3603327 2.6396673
AAACCCATCGAGCTGC SeuratProject 42194 7249 0.1564203 99.8435797
AAACGAAAGTTCCGGC SeuratProject 24685 4740 99.5827426 0.4172574
AAACGAACACTTTATC SeuratProject 48421 7164 0.1900002 99.8099998
AAACGAACATCAGTGT SeuratProject 60559 8136 0.2741128 99.7258872
AAAGGATTCTTCGATT SeuratProject 6478 1269 0.5557271 99.4442729
用小提琴可视化一下:
VlnPlot(pbmc,features =c("prop_human","prop_mouse"))
在这例样本中人和小鼠的区分还是比较明显的prop_mouse和prop_human都要么在100%要么在0%,所以我们可以按照每个细胞中人和小鼠基因的表达情况来划分细胞。
cell_species<- pbmc@meta.data %>%
mutate(species = case_when(
prop_mouse > 0.9 ~ "mouse",
prop_human > 0.9 ~ "human",
TRUE ~ "mixed"
))
pbmc <- AddMetaData(pbmc, metadata = cell_species$species, col.name = "species")
head(pbmc@meta.data)
head(pbmc@meta.data)
orig.ident nCount_RNA nFeature_RNA prop_mouse prop_human species
AAACCCAGTTGAAGTA SeuratProject 5531 2160 97.3603327 2.6396673 mouse
AAACCCATCGAGCTGC SeuratProject 42194 7249 0.1564203 99.8435797 human
AAACGAAAGTTCCGGC SeuratProject 24685 4740 99.5827426 0.4172574 mouse
AAACGAACACTTTATC SeuratProject 48421 7164 0.1900002 99.8099998 human
AAACGAACATCAGTGT SeuratProject 60559 8136 0.2741128 99.7258872 human
AAAGGATTCTTCGATT SeuratProject 6478 1269 0.5557271 99.4442729 human
seurat标准分析
pbmc <- pbmc %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA(dims = 1:30) %>%
RunTSNE(dims = 1:20, check_duplicates = FALSE)%>%
RunUMAP(dims = 1:20, check_duplicates = FALSE)
DimPlot(pbmc, reduction = "umap", pt.size = 0.5, group.by = "species")
差异分析
hmark <-FindMarkers(pbmc,ident.1 ="mouse", group.by ="species" )
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04m 56s
head(hmark )
p_val avg_logFC pct.1 pct.2 p_val_adj
hg19-MT-ND4L 6.307033e-223 -1.116752 0.026 0.970 7.072517e-218
hg19-RPS21 6.659728e-223 -2.062144 0.080 0.980 7.468020e-218
hg19-PSMA7 9.646581e-221 -1.897288 0.057 0.972 1.081739e-215
hg19-MT-ND6 2.502909e-220 -1.691199 0.031 0.976 2.806687e-215
hg19-MTATP6P1 7.943424e-219 -2.492510 0.058 0.982 8.907518e-214
hg19-GNG5 3.535631e-218 -1.487181 0.038 0.951 3.964751e-213
FeaturePlot(pbmc,features = c("hg19-MT-ND4L","mm10-Snf8"))
分群
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc<-FindClusters(pbmc)
DimPlot(pbmc, reduction = "umap", pt.size = 0.5)
整合
> species.list <- SplitObject(pbmc, split.by = "species")
> species.list
$mouse
An object of class Seurat
112137 features across 736 samples within 1 assay
Active assay: RNA (112137 features)
3 dimensional reductions calculated: pca, tsne, umap
$human
An object of class Seurat
112137 features across 494 samples within 1 assay
Active assay: RNA (112137 features)
3 dimensional reductions calculated: pca, tsne, umap
for (i in names(species.list)) {
species.list[[i]] <- SCTransform(species.list[[i]], verbose = FALSE)
}
species.features <- SelectIntegrationFeatures(object.list = species.list, nfeatures = 3000)
species.list <- PrepSCTIntegration(object.list = species.list, anchor.features = species.features)
species.anchors <- FindIntegrationAnchors(object.list = species.list, normalization.method = "SCT",
anchor.features = species.features)
species.integrated <- IntegrateData(anchorset = species.anchors, normalization.method = "SCT")
species.integrated <- RunPCA(object = species.integrated, verbose = FALSE)
species.integrated <- RunUMAP(object = species.integrated, dims = 1:30)
head(species.integrated@meta.data)
DimPlot(species.integrated, group.by = c("species"), combine = FALSE)
下游分析
其实得到这个表达谱之后,后面的受配体分析、亚群分析等都可以。