SingleR

教程网址（https://www.bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html)
安装网址(https://bioconductor.org/packages/3.11/bioc/html/SingleR.html)

Package

SingleR 1.2.4

1. Introduction

注释单细胞测序数据。注释原理：首先给定参考集，即已知身份的测序数据（single-cell or bulk），然后，SingleR基于相似性注释测试集数据。
流程：

计算测试集中每个样品与参考集中每个样品的相似度（spearman correlation）。这里只考虑marker基因，是来自于参考集中不同label样品两两比较找到的marker基因，合并在一起的合集。
以参考集细胞类型为单位，计算per-label score（quantile 0.8 of correlation distribution）
对所有label重复这个过程。选取最好分数的标签当做细胞身份。
选择性微调：
参考集根据标签微调；使用marker gene重算score；循环

2. Reference - using built-in reference

SingleR自带一些reference，多为bulk数据或者microarray数据。
参考集：eg. HumanPrimaryCellAtlasData()

library(SingleR)
hpca.se <- HumanPrimaryCellAtlasData()
hpca.se

## class: SummarizedExperiment 
## dim: 19363 713 
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(3): label.main label.fine label.ont

测试集： dataset from La Manno et al. (2016).
For the sake of speed, 只选前100

library(scRNAseq)
hESCs <- LaMannoBrainData('human-es')
hESCs <- hESCs[,1:100]

# SingleR() expects log-counts, but the function will also happily take raw
# counts for the test dataset. The reference, however, must have log-values.
library(scater)
hESCs <- logNormCounts(hESCs)
hESCs

class: SingleCellExperiment 
dim: 18538 100 
metadata(0):
assays(2): counts logcounts
rownames(18538): WASH7P_p1 LINC01002_loc4 ... IL9R_loc1 DDX11L16_loc1
rowData names(0):
colnames(100): 1772122_301_C02 1772122_180_E05 ... 1772122_298_F09 1772122_302_A11
colData names(3): Cell_ID Cell_type Timepoint
reducedDimNames(0):
spikeNames(0):
altExpNames(0):

接下来,利用参考集hpca.se注释测试集hESCs。

pred.hesc <- SingleR(test = hESCs, ref = hpca.se, labels = hpca.se$label.main)
pred.hesc

## DataFrame with 100 rows and 5 columns
##                                          scores         first.labels
##                                        <matrix>          <character>
## 1772122_301_C02  0.347652:0.109547:0.123901:... Neuroepithelial_cell
## 1772122_180_E05  0.361187:0.134934:0.148672:... Neuroepithelial_cell
## 1772122_300_H02  0.446411:0.190084:0.222594:... Neuroepithelial_cell
## 1772122_180_B09  0.373512:0.143537:0.164743:... Neuroepithelial_cell
## 1772122_180_G04  0.357341:0.126511:0.141987:... Neuroepithelial_cell
##                       tuning.scores               labels        pruned.labels
##                         <DataFrame>          <character>          <character>
## 1772122_301_C02 0.1824402:0.0991116 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_E05 0.1375484:0.0647134              Neurons              Neurons
## 1772122_300_H02 0.2757982:0.1369690 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_B09 0.0851623:0.0819878 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_G04 0.1988415:0.1016622 Neuroepithelial_cell Neuroepithelial_cell

※※※ 3. Reference - using single-cell data

以 scRNAseq package里的两个human pancreas 数据集为例。
参考集： data from Muraro et al. (2016)

library(scRNAseq)
sceM <- MuraroPancreasData()

# One should normally do cell-based quality control at this point, but for
# brevity's sake, we will just remove the unlabelled libraries here.
sceM <- sceM[,!is.na(sceM$label)]

sceM <- logNormCounts(sceM)

测试集：data from Grun et al. (2016)

sceG <- GrunPancreasData()
sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
sceG <- logNormCounts(sceG) 
sceG <- sceG[,1:100]

<font color=Red>※※※</font> SingleR() 加上marker detection模式来排除单细胞中基因表达的差异，用wilcoxon ranked sum test来确定标签之间的top marker基因。这个过程比较慢，但是相比于默认的marker检测程序（low coverage and median of expression is frequently 0）会更适合单细胞数据。

pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox")
table(pred.grun$labels)

## acinar   beta  delta   duct 
##     53      4      2     41

4. annotation diagnostics

4.1 based on the scores within cells

plotScoreHeatmap(pred.grun)
plotScoreHeatmap(pred.grun, 
    annotation_col=as.data.frame(colData(sceG)[,"donor",drop=FALSE]))

4.2 based on deltas across cells

to.remove <- pruneScores(pred.grun)
summary(to.remove)
##    Mode   FALSE    TRUE 
## logical      96       4

plotScoreDistribution(pred.grun, show = "delta.med", ncol = 3, show.nmads = 3)

new.pruned <- pred.grun$labels
new.pruned[pruneScores(pred.grun, nmads=5)] <- NA
table(new.pruned, useNA="always")
## new.pruned
## acinar   beta  delta   duct   <NA> 
##     53      4      2     41      0

4.3 based on marker gene expression

all.markers <- metadata(pred.grun)$de.genes
sceG$labels <- pred.grun$labels

# Beta cell-related markers
plotHeatmap(sceG, order_columns_by="labels",
    features=unique(unlist(all.markers$beta))) 
    
for (lab in unique(pred.grun$labels)) {
    plotHeatmap(sceG, order_columns_by=list(I(pred.grun$labels)), 
        features=unique(unlist(all.markers[[lab]]))) 
}

5. Available reference

https://www.bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html#1_introduction

6. Reference options

※※※ 6.1 Pseudo-bulk aggregation

单细胞测序数据集提供了like-for-like模式的比较，帮助细胞更好的分类（hopefully）。但是这种参考集也存在一些问题，单细胞参考集中存在比bulk参考集多得多的样品，增加了计算工作。我们可以通过整合每个label的单细胞数据，构建“pseudo bulk”数据（e.g., 平均log-expression value）。
这个方法最大的缺点是丢失了每种类型细胞的分布信息。这可能导致一个远离中心的细胞的异质性丢失。所以，我们使用k-means聚类方法尽量保留这些信息。
这种整合的方法应用于aggregateReferences()。

set.seed(100) # for the k-means step.
aggr <- aggregateReference(sceM, labels=sceM$label)
aggr
## class: SummarizedExperiment 
## dim: 19059 116 
## metadata(0):
## assays(1): logcounts
## rownames(19059): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
##   ZZZ3__chr1
## rowData names(0):
## colnames(116): alpha.1 alpha.2 ... mesenchymal.8 epsilon.1
## colData names(1): label

pred.aggr <- SingleR(sceG, aggr, labels=aggr$label)
table(pred.aggr$labels)
## acinar   beta  delta   duct 
##     53      4      1     42

6.2 多重参考集

6.3 融合标签

待更新

“贤者以其昭昭使人昭昭，今以其昏昏使人昭昭。” --《孟子·尽心下》

【单细胞】SingleR注释细胞类型