Inferring tumour purity and stromal and immune cell admixture from expression data.,NC,3013
单细胞转录组是揭示细胞异质性的的有力武器,鉴于肿瘤的异质性,这一点在肿瘤样本中表现尤为突出。所以肿瘤样本的单细胞转录组就不只是无监督地分个群那么简单,基于我们对肿瘤样本已经积累起来的生物学背景(如TCGA),我们可以从更多侧面来反映和说明肿瘤样本的异质性。
今天给大家介绍一款根据stromal和immune细胞比例估算肿瘤纯度的工具:ESTIMATE。之前是基于bulk表达谱来做的,在简书已经有详细的介绍了:
文章解读:
利用表达数据计算基质打分与免疫打分进而预测肿瘤纯度 --- ESTIMATE
代码实践:
使用ESTIMATE来根据stromal和immune细胞比例估算肿瘤纯度
ESTIMATE (E
stimation of ST
romal and I
mmune cells in MA
lignant T
umor tissues using E
xpression data) is a tool for predicting tumor purity
, and the presence of infiltrating stromal/immune cells in tumor tissues using gene expression data
. ESTIMATE algorithm is based on single sample Gene Set Enrichment Analysis and generates three scores:
- stromal score (that captures the presence of stroma in tumor tissue),
- immune score (that represents the infiltration of immune cells in tumor tissue), and
- estimate score (that infers tumor purity).
可以看到和目前单细胞转录组中有些基于富集的细胞类型定义还是很像的,根据一个基因list通过某种规则(这里是ssGSEA)来对细胞打分,进而推断出细胞的类型。
所以正如生信技能树在简书上所言:
其实对大部分使用该包的的文章来说,需要的反而是该包定义的2个基因集,stromal 和 immune , 列表是:
StromalSignature estimate DCN PAPPA SFRP4 THBS2 LY86 CXCL14 FOXF1 COL10A1 ACTG2 APBB1IP SH2D1A SULF1 MSR1 C3AR1 FAP PTGIS ITGBL1 BGN CXCL12 ECM2 FCGR2A MS4A4A WISP1 COL1A2 MS4A6A EDNRA VCAM1 GPR124 SCUBE2 AIF1 HEPH LUM PTGER3 RUNX1T1 CDH5 PIK3R5 RAMP3 LDB2 COX7A1 EDIL3 DDR2 FCGR2B LPPR4 COL15A1 AOC3 ITIH3 FMO1 PRKG1 PLXDC1 VSIG4 COL6A3 SGCD COL3A1 F13A1 OLFML1IGSF6 COMP HGF GIMAP5 ABCA6 ITGAM MAF ITM2A CLEC7A ASPN LRRC15 ERG CD86 TRAT1 COL8A2 TCF21 CD93 CD163 GREM1 LMOD1TLR2 ZEB2 C1QB KCNJ8 KDR CD33 RASGRP3 TNFSF4 CCR1 CSF1R BTK MFAP5 MXRA5 ISLR ARHGAP28 ZFPM2 TLR7 ADAM12 OLFML2B ENPP2 CILP SIGLEC1 SPON2 PLXNC1 ADAMTS5 SAMSN1 CH25H COL14A1 EMCN RGS4 PCDH12 RARRES2 CD248 PDGFRB C1QA COL5A3 IGF1 SP140TFEC TNN ATP8B4 ZNF423 FRZB SERPING1 ENPEP CD14 DIO2 FPR1 IL18R1 HDC TXNDC3 PDE2A RSAD2 ITIH5 FASLG MMP3 NOX4 WNT2 LRRC32 CXCL9 ODZ4 FBLN2 EGFL6 IL1B SPON1 CD200
ImmuneSignature estimate LCP2 LSP1 FYB PLEK HCK IL10RA LILRB1 NCKAP1L LAIR1 NCF2 CYBB PTPRC IL7R LAPTM5 CD53 EVI2BSLA ITGB2 GIMAP4 MYO1F HCLS1 MNDA IL2RG CD48 AOAH CCL5 LTB GMFG GIMAP6 GZMK LST1 GPR65 LILRB2 WIPF1 CD37 BIN2 FCER1G IKZF1 TYROBP FGL2 FLI1 IRF8 ARHGAP15 SH2B3 TNFRSF1B DOCK2 CD2 ARHGEF6 CORO1A LY96 LYZ ITGAL TNFAIP3 RNASE6TGFB1 PSTPIP1 CST7 RGS1 FGR SELL MICAL1 TRAF3IP3 ITGA4 MAFB ARHGDIB IL4R RHOH HLA-DPA1 NKG7 NCF4 LPXN ITK SELPLG HLA-DPB1 CD3D CD300A IL2RB ADCY7 PTGER4 SRGN CD247 CCR7 MSN ALOX5AP PTGER2 RAC2 GBP2 VAV1 CLEC2B P2RY14 NFKBIAS100A9 IFI30 MFSD1 RASSF2 TPP1 RHOG CLEC4A GZMB PVRIG S100A8 CASP1 BCL2A1 HLA-E KLRB1 GNLY RAB27A IL18RAP TPST2 EMP3 GMIP LCK IL32 PTPRCAP LGALS9 CCDC69 SAMHD1 TAP1 GBP1 CTSS GZMH ADAM8 GLRX PRF1 CD69 HLA-B HLA-DMA CD74 KLRK1 PTPRE HLA-DRA VNN2 TCIRG1 RABGAP1L CSTA ZAP70 HLA-F HLA-G CD52 CD302 CD27
其实这R包的函数本不多:
# 安装一下
library(utils)
rforge <- "http://r-forge.r-project.org"
install.packages("estimate", repos=rforge, dependencies=TRUE)
在这里我们就不再?estimateScore
来一步一步执行示例数据了,虽然对于新手来说这一步往往是不能省略的。实在想看就看技能树的吧:使用ESTIMATE来根据stromal和immune细胞比例估算肿瘤纯度
我就想,这么好的工具单细胞能不能使用呢?我也是表达谱啊,应该是可以的吧,如果可以,Seurat的数据格式可不可以直接做呢?
带着一系列疑问我们来试试。
无疑,作为表达谱我是合格的。关键就是数据格式的问题了。我们发现这个R包操作的都是路径,就连表达谱要求的都是:?estimate::filterCommonGenes
。如果想传入一个Seurat的对象我们是要改造一个函数了。
myfilterCommonGenes <- edit(estimate::filterCommonGenes)
看到这个函数的文件读入是用read.table
的:
function (input.f, output.f, id = c("GeneSymbol", "EntrezID"))
{
stopifnot((is.character(input.f) && length(input.f) == 1 &&
nzchar(input.f)) || (inherits(input.f, "connection") &&
isOpen(input.f, "r")))
stopifnot((is.character(output.f) && length(output.f) ==
1 && nzchar(output.f)))
id <- match.arg(id)
input.df <- read.table(input.f, header = TRUE, row.names = 1,
sep = "\t", quote = "", stringsAsFactors = FALSE)
merged.df <- merge(common_genes, input.df, by.x = id, by.y = "row.names")
rownames(merged.df) <- merged.df$GeneSymbol
merged.df <- merged.df[, -1:-ncol(common_genes)]
print(sprintf("Merged dataset includes %d genes (%d mismatched).",
nrow(merged.df), nrow(common_genes) - nrow(merged.df)))
outputGCT(merged.df, output.f)
}
我想直接把seurat的某个gene-cell矩阵对象给它,于是就把输入改成:
myfilterCommonGenes <- function(input.f, output.f, id = c("GeneSymbol", "EntrezID"))
{
id <- match.arg(id)
input.df <- input.f
merged.df <- merge(common_genes, input.df, by.x = id, by.y = "row.names")
rownames(merged.df) <- merged.df$GeneSymbol
merged.df <- merged.df[, -1:-ncol(common_genes)]
print(sprintf("Merged dataset includes %d genes (%d mismatched).",
nrow(merged.df), nrow(common_genes) - nrow(merged.df)))
outputGCT(merged.df, output.f)
}
为了让改造后的函数依然在estimated的环境之中:
environment(myfilterCommonGenes) <- environment(estimate::filterCommonGenes)
我们来试试,读入我们熟悉的pbmc数据,注意这里的数据仅作为Seurat的演示示例:
pbmc <- readRDS('G:\\Desktop\\Desktop\\RStudio\\single_cell\\filtered_gene_bc_matrices\\hg19pbmc_tutorial.rds')
DefaultAssay(pbmc) <- "RNA"
myfilterCommonGenes(input.f= as.matrix(pbmc@assays$RNA@data), output.f=paste0("matrixgenes.gct"), id="GeneSymbol")
一行提示:
[1] "Merged dataset includes 7295 genes (3117 mismatched)."
同时,在当前路径下生成了matrixgenes.gct
,然后我们启动打分程序:
?estimateScore # 还是建议大家多问
# c("affymetrix", "agilent", "illumina")
estimateScore(paste0("matrixgenes.gct"),paste0("estimate_score.gct"), platform= "affymetrix")
[1] "1 gene set: StromalSignature overlap= 58"
[1] "2 gene set: ImmuneSignature overlap= 141" # 可以看到overlap的基因比较少,我们毕竟是10X的数据啊。
我们比较感兴趣的就是这个打分文件estimate_score.gct
,我们来看那看是怎样的,以及如何把它写入Seurat对象中。
estimate_score <- read.table(paste0("estimate_score.gct"),header = F,skip=2,stringsAsFactors = FALSE)
estimate_score <- t(estimate_score)
rownames(estimate_score) <- estimate_score[,1]
colnames(estimate_score) <- estimate_score[2,]
estimate_score <- estimate_score[-c(1:2),]
estimate_score[1:4,1:4]
读进来的被字符串化了:
Description StromalScore ImmuneScore ESTIMATEScore
AAACATACAACCAC "AAACATACAACCAC" "269.574325901855" "1068.73943272047" "1338.31375862232"
AAACATTGAGCTAC "AAACATTGAGCTAC" "157.640476166265" "1245.7269262377" "1403.36740240396"
AAACATTGATCAGC "AAACATTGATCAGC" "399.842094329677" "1208.89031399533" "1608.732408325"
AAACCGTGCTTCCG "AAACCGTGCTTCCG" "595.490469453362" "1418.44175293577" "2013.93222238914"
所以AddMetaData
要处理一下:
pbmc<-AddMetaData(pbmc,metadata = estimate_score,col.name = colnames(df))
pbmc@meta.data$StromalScore <- as.numeric(as.vector(pbmc@meta.data$StromalScore))
pbmc@meta.data$ImmuneScore <- as.numeric(as.vector(pbmc@meta.data$ImmuneScore))
pbmc@meta.data$TumorPurity <- as.numeric(as.vector(pbmc@meta.data$TumorPurity))
pbmc@meta.data$ESTIMATEScore <- as.numeric(as.vector(pbmc@meta.data$ESTIMATEScore))
可以看到给每个细胞都打了分,bulk的一个样本是一个组织,可以用肿瘤纯度,单个细胞也还是有的吗?所以,是不是可以用某一群的平均表达谱来做呢?其实看到这只是根据基因列表的打分机制,这也未尝不可。
head(pbmc@meta.data)
orig.ident nCount_RNA nFeature_RNA percent.mt RNA_snn_res.0.5 seurat_clusters Description
AAACATACAACCAC pbmc3k 2419 779 3.0177759 1 1 AAACATACAACCAC
AAACATTGAGCTAC pbmc3k 4903 1352 3.7935958 3 3 AAACATTGAGCTAC
AAACATTGATCAGC pbmc3k 3147 1129 0.8897363 1 1 AAACATTGATCAGC
AAACCGTGCTTCCG pbmc3k 2639 960 1.7430845 2 2 AAACCGTGCTTCCG
AAACCGTGTATGCG pbmc3k 980 521 1.2244898 6 6 AAACCGTGTATGCG
AAACGCACTGGTAC pbmc3k 2163 781 1.6643551 1 1 AAACGCACTGGTAC
StromalScore ImmuneScore ESTIMATEScore TumorPurity
AAACATACAACCAC 269.5743 1068.739 1338.314 0.6956758
AAACATTGAGCTAC 157.6405 1245.727 1403.367 0.6887845
AAACATTGATCAGC 399.8421 1208.890 1608.732 0.6666206
AAACCGTGCTTCCG 595.4905 1418.442 2013.932 0.6211327
AAACCGTGTATGCG 540.3476 1019.445 1559.792 0.6719582
AAACGCACTGGTAC 366.6339 1103.304 1469.938 0.6816675
可视化一下,看看不同分群下,"StromalScore" "ImmuneScore" "ESTIMATEScore" "TumorPurity" 四个分数的比例:
p1<- RidgePlot(pbmc,features = "StromalScore")
p2<- RidgePlot(pbmc,features = "ImmuneScore")
p3<- RidgePlot(pbmc,features = "TumorPurity")
p4<- RidgePlot(pbmc,features = "ESTIMATEScore")
# p4<- Seurat::CombinePlots(c(p1 ,p2,p3,p4))
library(gridExtra)
grid.arrange(p1 ,p2,p3,p4,ncol = 2 )
就TumorPurity 来绘制熟悉的箱型图:
ggplot(pbmc@meta.data, aes(x=RNA_snn_res.0.5, y=TumorPurity,fill=RNA_snn_res.0.5)) +
geom_boxplot(notch=TRUE)
严格的话,可以做一下显著性检验啊,这里我们就不过了。我比较好奇的是这四个是什么关系?为什么还有TumorPurity ,ESTIMATEScore?
library(GGally)
ggpairs(pbmc@meta.data[,c("StromalScore","ImmuneScore","TumorPurity","ESTIMATEScore","seurat_clusters")],
mapping = aes(color = seurat_clusters))
可以看到TumorPurity ,ESTIMATEScore是负相关的关系,其实知道一个就可以了。结合这些可视化的结果可以为我们了解哪些群的肿瘤纯度如何,从这个侧面来解释细胞的异质性。
那么有没有其他软件呢?显然是有的,一个可以用的就是Xcell了:https://github.com/dviraran/xCell
xCell is a gene signatures-based method
learned from thousands of pure cell types from various sources. xCell applies a novel technique for reducing associations between closely related cell types. xCell signatures were validated using extensive in-silico simulations and also cytometry immunophenotyping
, and were shown to outperform previous methods. xCell allows researchers to reliably portray the cellular heterogeneity landscape of tissue expression profiles.
还有一个:https://cibersortx.stanford.edu/,需要注册。
官网:
https://bioinformatics.mdanderson.org/public-software/estimate/