单样本基因集富集分析 --- ssGSEA

1. 概述

单样本基因集富集分析（single sample gene set enrichment analysis, ssGSEA），是GSEA方法的扩展，主要是针对单个样本无法做GSEA而设计。文章2009年发表于nature，题目为Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1。

2. 算法

算法概括来说就是首先对给定样本的基因表达值进行秩次标准化，然后利用经验累积分布函数计算富集分数（ES）。

ssGSEA

设给定基因集为G，包含基因数为N_G，给定单个样本为S，表达谱包含基因数为N，N个基因按它们绝对表达值从高到低确定秩次。i 从1赋值到N，依此计算P_G^w和P_N_G。

3. 实现

R语言GSVA包可实现ssGSEA分析，GSVA包发布在Bioconductor上，可通过下列代码安装：

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GSVA")

我们所用到的函数是gsva()，其利用S4方法，signature可以是'ExpressionSet,list' ，'ExpressionSet,GeneSetCollection'， 'matrix,GeneSetCollection'，'matrix,list'。
即函数第一个参数可传递一个ExpressionSet对象或者一个常规的矩阵，第二个参数可传递一个常规的列表或者一个GeneSetCollection对象。参数method需要选择"ssgsea"，verbose参数可根据个人习惯选择为TRUE，其余参数选择默认即可，其中ssgsea.norm参数用最大值与最小值间的绝对差对ssGSEA分数进行标准化，摘录包文档内容如下：

--- ssgsea.norm
Logical, set to TRUE (default) with method="ssgsea" runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper. When ssgsea.norm=FALSE this last normalization step is skipped.

个人示例代码：

library(GSVA)
gs = read.csv("geneset.csv", stringsAsFactors = FALSE, check.names = FALSE)
a = read.table("RNA.csv", stringsAsFactors = FALSE, header = TRUE, row.names = 1, sep = ",")
a = as.matrix(a)
gs = as.list(gs)
gs = lapply(gs, function(x) x[!is.na(x)])
ssgsea_score = gsva(a, gs, method = "ssgsea", ssgsea.norm = TRUE, verbose = TRUE)   # signature 'matrix,list'
write.csv(ssgsea_score, "ssGSEA.csv")

其中geneset.csv文件每列首行是基因集名称，基因集包含的基因列在名称下面，RNA.csv文件是每行为基因，每列为样本的表达谱。运行后即可得到每个样本基于ssGSEA算法的基因集enrichment score。

另外，Broad institute 开发的GenePattern分析平台可在线运行ssGSEA，不熟悉R语言的同学可自行探索。https://www.genepattern.org/modules/docs/ssGSEAProjection/4