SingleR || 单细胞细胞类型定义工具

在线版界面

注：本教程的SingleR是老版本的(1.0.0)，由于SingleR在Revised: December 18th, 2019已经升级到SingleR 1.0.5，新版本的重写了大部分函数，特别是函数名都变了。如果使用singleR，请关注软件的升级信息。如果您用的是新版本的请参考新版的教程：https://www.bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html 。旧版本的安装包在QQ群：1057591379中，可以加群获取。

近年来，单细胞RNA-seq (scRNA-seq)的研究进展使疾病模型中描述基因表达变化（gene expression ）的精度达到了前所未有的水平。目前已发展出多种单细胞分析方法来检测基因表达的变化，并通过基因表达的相似性来聚类细胞。然而，根据细胞聚类进行分类在很大程度上依赖于已知的标记基因（ marker genes），通常分类工作手工完成的。这种策略具有主观性，限制了密切相关的细胞亚群的分化。

本文提出了一种新的scrna -seq无偏差细胞类型识别的计算方法：SingleR（Single -cell Recognition of cell types）。SingleR利用纯细胞类型的参考转录组数据集来独立推断每个单细胞的细胞可能类型。SingleR的注释与Seurat(一个为scRNA-seq设计的处理和分析包)相结合，为研究scRNA-seq数据提供了一个强大的工具。

我们开发了一个R包来生成带注释的scRNA-seq对象，然后可以使用SingleR web工具Single-cell Recognition对数据进行可视化和进一步分析。

devtools::install_github('dviraran/SingleR')
# this might take long, though mostly because of the installation of Seurat.

SingleR提供了内置的包装函数，可以用一个函数运行完整的l流程。SingleR提供了对Seurat的支持(http://satijalab.org/seurat/)，但是也可以使用任何其他scRNA-seq包。例1和例2解释了这些函数。这些函数帮助读取单细胞数据，使用不同的引用计算标签，并创建一个可以被SungleR绘图函数使用的对象。是，要为每个单元格运行SingleR和检索标签，可以使用以下函数:

singler = SingleR(method = "single", sc_data, ref_data, types, clusters = NULL,
  genes = "de", quantile.use = 0.8, p.threshold = 0.05,
  fine.tune = TRUE, fine.tune.thres = 0.05, sd.thres = 1,
  do.pvals = T, numCores = SingleR.numCores)

method can be ‘single’ or ‘cluster’. ‘cluster’ will annotate each cluster instead of each single cell. The cluster expression is the average of the expression of all the cells in the given cluster. If ‘cluster’ than ids must be given in the ‘clusters’ parameters.
sc_data is the single cell matrix. If the data is from full-length method than the counts must be normalized to gene length (this can be achieved by using the built-in TPM function).

警告必看：

warning('Do not use the scaled.data field in Seurat as input. This field represents relative expression across cells, and is not appropriate as input for SingleR. The raw.data and data field are ok, but only if from a non full-length method.')

案例一：Counts data, no previous analysis

singler = CreateSinglerSeuratObject(counts, annot = NULL, project.name,
  min.genes = 200, technology = "10X", species = "Human" (or "Mouse"), citation = "",
  ref.list = list(), normalize.gene.length = F, variable.genes = "de",
  fine.tune = T, reduce.file.size = T, do.signatures = T, min.cells = 2,
  npca = 10, regress.out = "nUMI", do.main.types = T,
  reduce.seurat.object = T, numCores = SingleR.numCores)
save(singler,file='singler_object.RData')

counts.file may be a tab delimited text file (with the prefix ‘.txt’), a matrix of the counts or $a 10X directory$ . Importantly, the rownames must be gene symbols. To combine multiple 10X datasets we provide the function $Combine.Multiple.10X.Datasets$ .
annot can be a tab delimited text file or a data.frame. Rownames correspond to column names in the counts data.
min.genes is a filter on samples with low number of non-zero genes.
ref.list is the reference that will be used for the annotation. If not supplied, this wrapper function will use predefined reference objects depending on the specie - Mouse: ImmGen and Mouse.RNAseq, Human: HPCA and Blueprint+Encode. It is probably best to start with these references before using more specific references. See below for explanation on how to generate a reference data set object.
normalize.gene.length - set to true if the data is from a full-length method (i.e. Smart-Seq), or FALSE is a 3’ method (i.e. Drop-seq).
variable.genes - the method for choosing the genes used for the correlations. ‘de’ uses pairwise difference between the cell types, ‘sd’ uses a general standard variation.
fine.tune - performs the fine-tuning step. This step may take long for big datasets, but can improve results significantly if the data contains subtle differences.
do.signatures - this step runs a single-sample gene set enrichment analysis (ssGSEA) for a set of predefined signatures (see the object human.egc or mouse.egc). This step may also take long, and can be set to FALSE to shorten computation time.
min.cells, npca and regress.out are all passed directly to Seurat to create a Seurat object.
do.main.types - compute the main types scores as well.
reduce.seurat.object - removes the raw.data and calc.params from the Seurat object. The size of the object will be significantly smaller (~10-fold).
numCores - number of cores to use in parallel. The default is the number of cores in the system minus 1.

案例二：Already have a single-cell object

singler = CreateSinglerObject(counts, annot = NULL, project.name, min.genes = 0,
  technology = "10X", species = "Human", citation = "",
  ref.list = list(), normalize.gene.length = F, variable.genes = "de",
  fine.tune = T, do.signatures = T, clusters = NULL, do.main.types = T, 
  reduce.file.size = T, numCores = SingleR.numCores)

singler$seurat = seurat.object # (optional)
singler$meta.data$orig.ident = seurat.object@meta.data$orig.ident # the original identities, if not supplied in 'annot'

## if using Seurat v3.0 and over use:
singler$meta.data$xy = seurat.object@reductions$tsne@cell.embeddings # the tSNE coordinates
singler$meta.data$clusters = seurat.object@active.ident # the Seurat clusters (if 'clusters' not provided)

## if using a previous Seurat version use:
singler$meta.data$xy = seurat.object@dr$tsne@cell.embeddings # the tSNE coordinates
singler$meta.data$clusters = seurat.object@ident # the Seurat clusters (if 'clusters' not provided)

# this example is of course if the previous analysis was performed with Seurat, but any other previous coordinates and clusters can be used.

save(singler,file='singler_object.RData')

创建一个新的参考数据集

我们有一个想要使用的参考数据集。它包含N个样本，可以标注为n1主要细胞类型(即巨噬细胞或DCs)和n2细胞状态(即肺泡巨噬细胞、间质巨噬细胞、pDCs和cDCs)。
基因表达数据应按基因长度归一化(TPM、FPKM等)，以log2标准化。行名必须是基因符号(gene symbols.)。

 name = 'My_reference'
  expr = as.matrix(expr) # the expression matrix
  types = as.character(types) # a character list of the types. Samples from the same type should have the same name.
  main_types = as.character(main_types) # a character list of the main types. 
  ref = list(name=name,data = expr, types=types, main_types=main_types)
  
  # if using the de method, we can predefine the variable genes
  ref$de.genes = CreateVariableGeneSet(expr,types,200)
  ref$de.genes.main = CreateVariableGeneSet(expr,main_types,300)
  
  # if using the sd method, we need to define an sd threshold
  sd = rowsSd(expr)
  sd.thres = sort(sd, decreasing = T)[4000] # or any other threshold
  ref$sd.thres = sd.thres
  
  save(ref,file='ref.RData') # it is best to name the object and the file with the same name.
  
  # we can then use this reference in the previous functions. Multiple references can used.
  singler = CreateSinglerObject(... ref.list = list(immgen, ref, mouse.rnaseq)

原理

Step 1: Spearman correlations

计算参考数据集中每个样本的单细胞表达的斯皮尔曼系数。相关分析仅对参考数据集中的变异基因（variable genes ）进行。下面的示例显示了单个细胞(x轴)和参考样本(y轴)的表达式之间的相关性。这个散点图中的每个点都是一个基因

Variable genes: SingleR supports two modes for choosing the variable genes in the reference dataset.

‘sd’ - genes with a standard deviation across all samples in the reference dataset over a threshold. We choose thresholds such that we start with 3000-4000 genes.
‘de’ - top N genes that have a higher median expression in a cell type compared to each other cell type.

Step 2: Aggregation of scores by cell types

根据参考数据集的命名注释聚合每个细胞类型的多个相关系数，从而为每个细胞类型提供一个值。如上所述，这些示例是由广泛的细胞类型(“main”)或具有更高精度的细胞子集聚合的。默认值是每个细胞类型的相关值的80百分位数。

下面是一个针对单个人类细胞的注释过程示例。这里的点是使用一个细胞的所有参考样本(使用Blueprint+Encode参考)的Spearman系数。斯皮尔曼系数是按细胞类型聚合的(这里为了简单起见，减少了一组主要细胞类型)。每种细胞类型的单点评分是每个箱形图中的80%。这种细胞类型显然是t细胞或NK细胞，但不清楚到底是哪种类型。

上面的分析将细胞子集和状态分组为主要细胞类型。SingleR允许更细粒度的细胞类型(只显示得分最高的细胞类型):

Step 3: Fine-tuning

在此步骤中，SingleR将重新运行相关分析，但只针对步骤2中的相关性较高的细胞类型。该分析仅对这些细胞类型之间的可变基因进行。移除最低值的细胞类型(或比最高值低0.05的边缘)，然后重复此步骤，直到只保留两种细胞类型。最后一次运行后，与顶部值对应的细胞类型被分配给单个细胞。

在上面的例子中，SingleR清楚地表明了单细胞是一个记忆t细胞。然而，很难指出这些细胞子集中哪一个最适合它。微调步骤有助于分化密切相关的细胞类型。在第一次微调迭代中，选择顶部细胞类型(与CD4+ Tem评分相差0.05)。然后进行斯皮尔曼相关分析，但只使用这些细胞之间的可变基因。在对所有细胞类型进行微调之前，使用了3782个基因。在第一次微调迭代中，只有1819个基因被用来分化9种细胞类型。

在此迭代之后，将保留5种细胞类型。

SingleR继续这些迭代，每次获的相关性最高类型或删除得分最低的类型。

最后，成功的注释是一个调节性t细胞(Treg)。这个细胞实际上是一个排序的Treg，但是它不表达已知的标记（marker），如FOXP3和CTLA4，这使得基于标记（marker-based ）的方法很难检测到。

SingleR
Single-cell Recognition
Aran, Looney, Liu et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nature Immunology (2019)

http://comphealth.ucsf.edu/SingleR/SupplementaryInformation2.html#case-study-3-simulating-number-of-non-zero-genes

最后编辑于：2020.04.04 18:04:16

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,547评论 6赞 477
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,399评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,428评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,599评论 1赞 274
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,612评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,577评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,941评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,603评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,852评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,605评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,693评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,375评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,955评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,936评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,172评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 43,970评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,414评论 2赞 342