Using SingleR to annotate single-cell RNA-seq data

Aaron Lun*, Jared M. Andrews1, Friederike Dündar2 and Daniel Bunis3

1Washington University in St. Louis, School of Medicine, St. Louis, MO, USA
2Applied Bioinformatics Core, Weill Cornell Medicine
3Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA

*infinite.monkeys.with.keyboards@gmail.com

Revised: June 14th, 2020

Package

SingleR 1.4.1

1Introduction

SingleR is an automatic annotation method for single-cell RNA sequencing (scRNAseq) data (Aran et al. 2019). Given a reference dataset of samples (single-cell or bulk) with known labels, it labels new cells from a test dataset based on similarity to the reference. Thus, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this biological knowledge can be propagated to new datasets in an automated manner.

To keep things brief, this vignette only provides a brief summary of the basic capabilities of SingleR. However, the package also provides more advanced functionality that includes the use of multiple references simultaneously, manipulating the cell ontology and improving performance on big datasets. Readers are referred to the book for more details.

2Using built-in references

The easiest way to use SingleR is to annotate cells against built-in references. In particular, the celldex package provides access to several reference datasets (mostly derived from bulk RNA-seq or microarray data) through dedicated retrieval functions. Here, we will use the Human Primary Cell Atlas (Mabbott et al. 2013), represented as a SummarizedExperiment object containing a matrix of log-expression values with sample-level labels.

library(celldex)
hpca.se <- HumanPrimaryCellAtlasData()
hpca.se
## class: SummarizedExperiment 
## dim: 19363 713 
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(3): label.main label.fine label.ont

Our test dataset consists of some human embryonic stem cells (La Manno et al. 2016) from the scRNAseq package. For the sake of speed, we will only label the first 100 cells from this dataset.

library(scRNAseq)
hESCs <- LaMannoBrainData('human-es')
hESCs <- hESCs[,1:100]

We use our hpca.se reference to annotate each cell in hESCs via the SingleR() function. This identifies marker genes from the reference and uses them to compute assignment scores (based on the Spearman correlation across markers) for each cell in the test dataset against each label in the reference. The label with the highest score is the assigned to the test cell, possibly with further fine-tuning to resolve closely related labels.

library(SingleR)
pred.hesc <- SingleR(test = hESCs, ref = hpca.se, assay.type.test=1,
    labels = hpca.se$label.main)

Each row of the output DataFrame contains prediction results for a single cell. Labels are shown before fine-tuning (first.labels), after fine-tuning (labels) and after pruning (pruned.labels), along with the associated scores.

pred.hesc
## DataFrame with 100 rows and 5 columns
##                                          scores         first.labels
##                                        <matrix>          <character>
## 1772122_301_C02  0.347652:0.109547:0.123901:... Neuroepithelial_cell
## 1772122_180_E05  0.361187:0.134934:0.148672:... Neuroepithelial_cell
## 1772122_300_H02  0.446411:0.190084:0.222594:... Neuroepithelial_cell
## 1772122_180_B09  0.373512:0.143537:0.164743:... Neuroepithelial_cell
## 1772122_180_G04  0.357341:0.126511:0.141987:... Neuroepithelial_cell
## ...                                         ...                  ...
## 1772122_299_E07 0.371989:0.169379:0.1986877:... Neuroepithelial_cell
## 1772122_180_D02 0.353314:0.115864:0.1374981:... Neuroepithelial_cell
## 1772122_300_D09 0.348789:0.136732:0.1303042:... Neuroepithelial_cell
## 1772122_298_F09 0.332361:0.141439:0.1437860:... Neuroepithelial_cell
## 1772122_302_A11 0.324928:0.101609:0.0949826:... Neuroepithelial_cell
##                       tuning.scores               labels        pruned.labels
##                         <DataFrame>          <character>          <character>
## 1772122_301_C02 0.1824402:0.0991116 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_E05 0.1375484:0.0647134              Neurons              Neurons
## 1772122_300_H02 0.2757982:0.1369690 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_B09 0.0851623:0.0819878 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_G04 0.1988415:0.1016622 Neuroepithelial_cell Neuroepithelial_cell
## ...                             ...                  ...                  ...
## 1772122_299_E07 0.1760025:0.0922504 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_D02 0.1967609:0.1124805 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_300_D09 0.0816424:0.0221368 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_298_F09 0.1872499:0.0671893 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_302_A11 0.1560800:0.1051322            Astrocyte            Astrocyte
# Summarizing the distribution:
table(pred.hesc$labels)
## 
##            Astrocyte Neuroepithelial_cell              Neurons 
##                   14                   81                    5

At this point, it is worth noting that SingleR is workflow/package agnostic. The above example uses SummarizedExperiment objects, but the same functions will accept any (log-)normalized expression matrix.

3Using single-cell references

Here, we will use two human pancreas datasets from the scRNAseq package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the Muraro et al. (2016) dataset to be our reference.

library(scRNAseq)
sceM <- MuraroPancreasData()

# One should normally do cell-based quality control at this point, but for
# brevity's sake, we will just remove the unlabelled libraries here.
sceM <- sceM[,!is.na(sceM$label)]

# SingleR() expects reference datasets to be normalized and log-transformed.
library(scuttle)
sceM <- logNormCounts(sceM)

We then set up our test dataset from Grun et al. (2016). To speed up this demonstration, we will subset to the first 100 cells.

sceG <- GrunPancreasData()
sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
sceG <- logNormCounts(sceG) 
sceG <- sceG[,1:100]

We then run SingleR() as described previously but with a marker detection mode that considers the variance of expression across cells. Here, we will use the Wilcoxon ranked sum test to identify the top markers for each pairwise comparison between labels. This is slower but more appropriate for single-cell data compared to the default marker detection algorithm (which may fail for low-coverage data where the median is frequently zero).

pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox")
table(pred.grun$labels)
## 
## acinar   beta  delta   duct 
##     53      4      2     41

4Annotation diagnostics

plotScoreHeatmap() displays the scores for all cells across all reference labels, which allows users to inspect the confidence of the predicted labels across the dataset. Ideally, each cell (i.e., column of the heatmap) should have one score that is obviously larger than the rest, indicating that it is unambiguously assigned to a single label. A spread of similar scores for a given cell indicates that the assignment is uncertain, though this may be acceptable if the uncertainty is distributed across similar cell types that cannot be easily resolved.

plotScoreHeatmap(pred.grun)
image.png

Another diagnostic is based on the per-cell “deltas”, i.e., the difference between the score for the assigned label and the median across all labels for each cell. Low deltas indicate that the assignment is uncertain, which is especially relevant if the cell’s true label does not exist in the reference. We can inspect these deltas across cells for each label using the plotDeltaDistribution() function.

plotDeltaDistribution(pred.grun, ncol = 3)
image.png

The pruneScores() function will remove potentially poor-quality or ambiguous assignments based on the deltas. The minimum threshold on the deltas is defined using an outlier-based approach that accounts for differences in the scale of the correlations in various contexts - see ?pruneScores for more details. SingleR() will also report the pruned scores automatically in the pruned.labels field where low-quality assignments are replaced with NA.

summary(is.na(pred.grun$pruned.labels))
##    Mode   FALSE    TRUE 
## logical      96       4

Finally, a simple yet effective diagnostic is to examine the expression of the marker genes for each label in the test dataset. We extract the identity of the markers from the metadata of the SingleR() results and use them in the plotHeatmap() function from scater, as shown below for beta cell markers. If a cell in the test dataset is confidently assigned to a particular label, we would expect it to have strong expression of that label’s markers. At the very least, it should exhibit upregulation of those markers relative to cells assigned to other labels.

all.markers <- metadata(pred.grun)$de.genes
sceG$labels <- pred.grun$labels

# Beta cell-related markers
library(scater)
plotHeatmap(sceG, order_columns_by="labels",
    features=unique(unlist(all.markers$beta))) 
image.png

5FAQs

How can I use this with my Seurat, SingleCellExperiment, or cell_data_set object?

SingleR is workflow agnostic - all it needs is normalized counts. An example showing how to map its results back to common single-cell data objects is available in the README.

Where can I find reference sets appropriate for my data?

celldex contains many built-in references that were previously in SingleR but have been migrated into a separate package for more general use by other Bioconductor packages. scRNAseq contains many single-cell datasets, many of which contain the authors’ manual annotations. ArrayExpress and GEOquery can be used to download any of the bulk or single-cell datasets in ArrayExpress or GEO, respectively.

Where can I get more help?

It is likely that your questions is already answered by the function documentation (e.g., ?SingleR). Further explanations on the reasoning behind certain functions can be found in the book. If this is not sufficient, we recommend posting an issue on the Bioconductor support site or the GitHub repository for this package. Be sure to include your session information and a minimal reproducible example.

6Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.18.3               ggplot2_3.3.3              
##  [3] scuttle_1.0.4               SingleR_1.4.1              
##  [5] scRNAseq_2.4.0              SingleCellExperiment_1.12.0
##  [7] celldex_1.0.0               SummarizedExperiment_1.20.0
##  [9] Biobase_2.50.0              GenomicRanges_1.42.0       
## [11] GenomeInfoDb_1.26.2         IRanges_2.24.1             
## [13] S4Vectors_0.28.1            BiocGenerics_0.36.0        
## [15] MatrixGenerics_1.2.1        matrixStats_0.58.0         
## [17] BiocStyle_2.18.1           
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.6.0              colorspace_2.0-0             
##   [3] ellipsis_0.3.1                bluster_1.0.0                
##   [5] XVector_0.30.0                BiocNeighbors_1.8.2          
##   [7] farver_2.0.3                  bit64_4.0.5                  
##   [9] interactiveDisplayBase_1.28.0 AnnotationDbi_1.52.0         
##  [11] xml2_1.3.2                    sparseMatrixStats_1.2.1      
##  [13] cachem_1.0.1                  knitr_1.31                   
##  [15] Rsamtools_2.6.0               dbplyr_2.0.0                 
##  [17] pheatmap_1.0.12               shiny_1.6.0                  
##  [19] BiocManager_1.30.10           compiler_4.0.3               
##  [21] httr_1.4.2                    dqrng_0.2.1                  
##  [23] assertthat_0.2.1              Matrix_1.3-2                 
##  [25] fastmap_1.1.0                 lazyeval_0.2.2               
##  [27] limma_3.46.0                  later_1.1.0.1                
##  [29] BiocSingular_1.6.0            htmltools_0.5.1.1            
##  [31] prettyunits_1.1.1             tools_4.0.3                  
##  [33] rsvd_1.0.3                    igraph_1.2.6                 
##  [35] gtable_0.3.0                  glue_1.4.2                   
##  [37] GenomeInfoDbData_1.2.4        dplyr_1.0.4                  
##  [39] rappdirs_0.3.3                Rcpp_1.0.6                   
##  [41] vctrs_0.3.6                   Biostrings_2.58.0            
##  [43] ExperimentHub_1.16.0          rtracklayer_1.50.0           
##  [45] DelayedMatrixStats_1.12.2     xfun_0.20                    
##  [47] stringr_1.4.0                 beachmat_2.6.4               
##  [49] mime_0.9                      lifecycle_0.2.0              
##  [51] irlba_2.3.3                   ensembldb_2.14.0             
##  [53] statmod_1.4.35                XML_3.99-0.5                 
##  [55] AnnotationHub_2.22.0          edgeR_3.32.1                 
##  [57] scales_1.1.1                  zlibbioc_1.36.0              
##  [59] hms_1.0.0                     promises_1.1.1               
##  [61] ProtGenerics_1.22.0           RColorBrewer_1.1-2           
##  [63] AnnotationFilter_1.14.0       yaml_2.2.1                   
##  [65] curl_4.3                      gridExtra_2.3                
##  [67] memoise_2.0.0                 biomaRt_2.46.2               
##  [69] stringi_1.5.3                 RSQLite_2.2.3                
##  [71] highr_0.8                     BiocVersion_3.12.0           
##  [73] scran_1.18.3                  GenomicFeatures_1.42.1       
##  [75] BiocParallel_1.24.1           rlang_0.4.10                 
##  [77] pkgconfig_2.0.3               bitops_1.0-6                 
##  [79] evaluate_0.14                 lattice_0.20-41              
##  [81] purrr_0.3.4                   labeling_0.4.2               
##  [83] GenomicAlignments_1.26.0      bit_4.0.4                    
##  [85] tidyselect_1.1.0              magrittr_2.0.1               
##  [87] bookdown_0.21                 R6_2.5.0                     
##  [89] magick_2.6.0                  generics_0.1.0               
##  [91] DelayedArray_0.16.1           DBI_1.1.1                    
##  [93] pillar_1.4.7                  withr_2.4.1                  
##  [95] RCurl_1.98-1.2                tibble_3.0.6                 
##  [97] crayon_1.4.0                  BiocFileCache_1.14.0         
##  [99] rmarkdown_2.6                 viridis_0.5.1                
## [101] progress_1.2.2                locfit_1.5-9.4               
## [103] grid_4.0.3                    blob_1.2.1                   
## [105] digest_0.6.27                 xtable_1.8-4                 
## [107] httpuv_1.5.5                  munsell_0.5.0                
## [109] openssl_1.4.3                 beeswarm_0.2.3               
## [111] viridisLite_0.3.0             vipor_0.4.5                  
## [113] askpass_1.1

References

Aran, D., A. P. Looney, L. Liu, E. Wu, V. Fong, A. Hsu, S. Chak, et al. 2019. “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol. 20 (2):163–72.

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.

La Manno, G., D. Gyllborg, S. Codeluppi, K. Nishimura, C. Salto, A. Zeisel, L. E. Borm, et al. 2016. “Molecular Diversity of Midbrain Development in Mouse, Human, and Stem Cells.” Cell 167 (2):566–80.

Mabbott, Neil A., J. K. Baillie, Helen Brown, Tom C. Freeman, and David A. Hume. 2013. “An expression atlas of human primary cells: Inference of gene function from coexpression networks.” BMC Genomics 14. https://doi.org/10.1186/1471-2164-14-632.

Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4):385–94.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,921评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,635评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,393评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,836评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,833评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,685评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,043评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,694评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 42,671评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,670评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,779评论 1 332
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,424评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,027评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,984评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,214评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,108评论 2 351
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,517评论 2 343

推荐阅读更多精彩内容