10X单细胞转录组多物种联合分析：以人和小鼠为例

10X单细胞转录组是可以联合分析多个物种的。也许，你会问为什么要分析多个物种，我一个物种的某个器官还整明白呢？其实在做：

模型
移植
免疫
病毒
生殖隔离

的时候，同一个组织的细胞并不是同一个来源的，或者同一个细胞的mRNA本身有不同的来源（如病毒侵染的细胞）。这就需要确定不同来源的比例及其演化关系。就拿最近的新冠肺炎来说吧，在单细胞水平上做的一个关键就是区分细胞内宿主和病毒基因的表达。

今天我们以10X平台下人和小鼠为例介绍一下多物种联合分析。

参考基因组

众所周知，在单一物种中我们是基于比对来给UMI定量的。那么，多个物种中的一个关键就是构建这样一个多物种的基因组，用于定量。也许我们要看一下基因组到底是什么了：

fa序列
注释信息

其实基因组主要的组成就这两部分，当然又很所指标来衡量，比如组装的水平啊，注释的是否详细啊。每个基因组都有这两个文件，质言之，我们按照某种规则把它们分别合并到一起就可以了。10X给出了规范程序，构建的基因组可以直接用于cellranger count：

cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf \
                   --genome=mm10 --fasta=mm10.fa --genes=mm10-filtered-ensembl.gtf

详细的描述在这了： https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references#multi

这也说明基因组作为基础研究的重要性。

cellranger 结果

配置好基因组文件后面的分析就可以跑cellranger了，结果文件是这样：

序列的基本信息：

比对信息：

样本和GEM信息：

在seurat中分析

读入数据

library(Seurat)
library(tidyverse)

pbmc <-Read10X("filtered_feature_bc_matrix")
pbmc<- CreateSeuratObject(pbmc)
pbmc

An object of class Seurat 
112137 features across 1230 samples within 1 assay 
Active assay: RNA (112137 features)

查看数据

> tail(rownames(pbmc),3)
[1] "mm10-CAAA01064564.1" "mm10-Vmn2r122"       "mm10-CAAA01147332.1"
> head(rownames(pbmc),3)
[1] "hg19-DDX11L1"    "hg19-WASH7P"     "hg19-MIR1302-10"
> head(pbmc@meta.data)
                    orig.ident nCount_RNA nFeature_RNA
AAACCCAGTTGAAGTA SeuratProject       5531         2160
AAACCCATCGAGCTGC SeuratProject      42194         7249
AAACGAAAGTTCCGGC SeuratProject      24685         4740
AAACGAACACTTTATC SeuratProject      48421         7164
AAACGAACATCAGTGT SeuratProject      60559         8136
AAAGGATTCTTCGATT SeuratProject       6478         1269

可以看出基因名前面区分了人和小鼠，但是细胞并没有区分出来，因为前面我们说过，有的细胞中可能是有不同物种来源的基因的。

计算每个细胞人和小鼠基因的占比

> pbmc[["prop_mouse"]] <- PercentageFeatureSet(pbmc, pattern = "^mm10-")
> pbmc[["prop_human"]] <- PercentageFeatureSet(pbmc, pattern = "^hg19-")
> head(pbmc@meta.data)
                    orig.ident nCount_RNA nFeature_RNA prop_mouse prop_human
AAACCCAGTTGAAGTA SeuratProject       5531         2160 97.3603327  2.6396673
AAACCCATCGAGCTGC SeuratProject      42194         7249  0.1564203 99.8435797
AAACGAAAGTTCCGGC SeuratProject      24685         4740 99.5827426  0.4172574
AAACGAACACTTTATC SeuratProject      48421         7164  0.1900002 99.8099998
AAACGAACATCAGTGT SeuratProject      60559         8136  0.2741128 99.7258872
AAAGGATTCTTCGATT SeuratProject       6478         1269  0.5557271 99.4442729

用小提琴可视化一下：

VlnPlot(pbmc,features =c("prop_human","prop_mouse"))

在这例样本中人和小鼠的区分还是比较明显的prop_mouse和prop_human都要么在100%要么在0%，所以我们可以按照每个细胞中人和小鼠基因的表达情况来划分细胞。

cell_species<- pbmc@meta.data %>% 
  mutate(species = case_when(
    prop_mouse > 0.9 ~ "mouse",
    prop_human > 0.9 ~ "human",
    TRUE ~ "mixed"
  ))


pbmc <- AddMetaData(pbmc, metadata = cell_species$species, col.name = "species")
head(pbmc@meta.data)

head(pbmc@meta.data)
                    orig.ident nCount_RNA nFeature_RNA prop_mouse prop_human species
AAACCCAGTTGAAGTA SeuratProject       5531         2160 97.3603327  2.6396673   mouse
AAACCCATCGAGCTGC SeuratProject      42194         7249  0.1564203 99.8435797   human
AAACGAAAGTTCCGGC SeuratProject      24685         4740 99.5827426  0.4172574   mouse
AAACGAACACTTTATC SeuratProject      48421         7164  0.1900002 99.8099998   human
AAACGAACATCAGTGT SeuratProject      60559         8136  0.2741128 99.7258872   human
AAAGGATTCTTCGATT SeuratProject       6478         1269  0.5557271 99.4442729   human

seurat标准分析

pbmc <- pbmc %>%
  NormalizeData() %>% 
  FindVariableFeatures() %>% 
  ScaleData() %>% 
  RunPCA(dims = 1:30) %>%
  RunTSNE(dims = 1:20, check_duplicates = FALSE)%>%
  RunUMAP(dims = 1:20, check_duplicates = FALSE)

DimPlot(pbmc, reduction = "umap", pt.size = 0.5, group.by = "species")

差异分析

hmark <-FindMarkers(pbmc,ident.1 ="mouse", group.by ="species" )
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04m 56s
head(hmark )

                      p_val avg_logFC pct.1 pct.2     p_val_adj
hg19-MT-ND4L  6.307033e-223 -1.116752 0.026 0.970 7.072517e-218
hg19-RPS21    6.659728e-223 -2.062144 0.080 0.980 7.468020e-218
hg19-PSMA7    9.646581e-221 -1.897288 0.057 0.972 1.081739e-215
hg19-MT-ND6   2.502909e-220 -1.691199 0.031 0.976 2.806687e-215
hg19-MTATP6P1 7.943424e-219 -2.492510 0.058 0.982 8.907518e-214
hg19-GNG5     3.535631e-218 -1.487181 0.038 0.951 3.964751e-213

FeaturePlot(pbmc,features = c("hg19-MT-ND4L","mm10-Snf8"))

分群

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc<-FindClusters(pbmc)
DimPlot(pbmc, reduction = "umap", pt.size = 0.5)

整合

> species.list <- SplitObject(pbmc, split.by = "species")
> species.list
$mouse
An object of class Seurat 
112137 features across 736 samples within 1 assay 
Active assay: RNA (112137 features)
 3 dimensional reductions calculated: pca, tsne, umap

$human
An object of class Seurat 
112137 features across 494 samples within 1 assay 
Active assay: RNA (112137 features)
 3 dimensional reductions calculated: pca, tsne, umap

for (i in names(species.list)) {
  species.list[[i]] <- SCTransform(species.list[[i]], verbose = FALSE)
}
species.features <- SelectIntegrationFeatures(object.list = species.list, nfeatures = 3000)
species.list <- PrepSCTIntegration(object.list = species.list, anchor.features = species.features)
species.anchors <- FindIntegrationAnchors(object.list = species.list, normalization.method = "SCT", 
                                       anchor.features = species.features)
species.integrated <- IntegrateData(anchorset = species.anchors, normalization.method = "SCT")

species.integrated <- RunPCA(object = species.integrated, verbose = FALSE)
species.integrated <- RunUMAP(object = species.integrated, dims = 1:30)
head(species.integrated@meta.data)
DimPlot(species.integrated, group.by = c("species"), combine = FALSE)

下游分析

其实得到这个表达谱之后，后面的受配体分析、亚群分析等都可以。

mixing-mouse-and-human-10x-single-cell-rnaseq-data