SCENIC转录因子分析实战---以Seurat的pbmc3K数据集为例

下面的代码复制粘贴即可运行，超级简单，如果是你自己的数据，你只需同样的模式做出来 exprMat 表达矩阵，和cellInfo的临床表型，就可以走这个SCENIC流程的4个步骤啦。

rm(list = ls()) 
library(Seurat) 
# devtools::install_github('satijalab/seurat-data')
library(SeuratData)
AvailableData()
# InstallData("pbmc3k") #  (89.4 MB) 
data("pbmc3k") 
exprMat  <-  as.matrix(pbmc3k@assays$RNA@data)
dim(exprMat)
exprMat[1:4,1:4] 
cellInfo <-  pbmc3k@meta.data[,c(4,2,3)]
colnames(cellInfo)=c('CellType', 'nGene' ,'nUMI')
head(cellInfo)
table(cellInfo$CellType)

### Initialize settings
library(SCENIC)
# 保证cisTarget_databases 文件夹下面有下载好2个1G的文件
scenicOptions <- initializeScenic(org="hgnc", 
                                  dbDir="cisTarget_databases", nCores=1) 
saveRDS(scenicOptions, file="int/scenicOptions.Rds") 

### Co-expression network
genesKept <- geneFiltering(exprMat, scenicOptions)
exprMat_filtered <- exprMat[genesKept, ]
exprMat_filtered[1:4,1:4]
dim(exprMat_filtered)
runCorrelation(exprMat_filtered, scenicOptions)
exprMat_filtered_log <- log2(exprMat_filtered+1) 
runGenie3(exprMat_filtered_log, scenicOptions)

### Build and score the GRN
exprMat_log <- log2(exprMat+1)
scenicOptions@settings$dbs <- scenicOptions@settings$dbs["10kb"] # Toy run settings
scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)
scenicOptions <- runSCENIC_2_createRegulons(scenicOptions,
                                            coexMethod=c("top5perTarget")) # Toy run settings
library(doParallel)
scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log ) 
scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)
tsneAUC(scenicOptions, aucType="AUC") # choose settings`

因为我们这个是实战案例，表达矩阵很大，接近3000个细胞，是全部的人类基因，所以耗费了一个晚上才完成这个流程，运行的log日志如下：

`> runGenie3(exprMat_filtered_log, scenicOptions)
Using 480 TFs as potential regulators...
Running GENIE3 part 1
Running GENIE3 part 10
Running GENIE3 part 2
Running GENIE3 part 3
Running GENIE3 part 4
Running GENIE3 part 5
Running GENIE3 part 6
Running GENIE3 part 7
Running GENIE3 part 8
Running GENIE3 part 9
Finished running GENIE3.
> 
> ### Build and score the GRN
> exprMat_log <- log2(exprMat+1)
> scenicOptions@settings$dbs <- scenicOptions@settings$dbs["10kb"] # Toy run settings
> scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)
07:33 Creating TF modules
Number of links between TFs and targets (weight>=0.001): 1773984
             [,1]
nTFs          480
nTargets     5318
nGeneSets    3835
nLinks    2516422
> scenicOptions <- runSCENIC_2_createRegulons(scenicOptions,
+                                             coexMethod=c("top5perTarget")) # Toy run settings
07:34 Step 2. Identifying regulons
tfModulesSummary:
              [,1]
top5perTarget   59
07:34 RcisTarget: Calculating AUC
Scoring database:  [Source file: hg19-tss-centered-10kb-7species.mc9nr.feather]
07:35 RcisTarget: Adding motif annotation
Using BiocParallel...
Number of motifs in the initial enrichment: 18696
Number of motifs annotated to the matching TF: 334
07:35 RcisTarget: Prunning targets
07:37 Number of motifs that support the regulons: 334
 Preview of motif enrichment saved as: output/Step2_MotifEnrichment_preview.html
> library(doParallel)
Loading required package: iterators
> scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log ) 
07:37 Step 3. Analyzing the network activity in each individual cell
 Number of regulons to evaluate on cells: 20
Biggest (non-extended) regulons: 
  SPI1 (538g)
  CEBPB (47g)
  CEBPD (41g)
  SPIB (39g)
  TBX21 (27g)
  IRF8 (17g)
  STAT1 (14g)
  IRF7 (12g)
Quantiles for the number of genes detected by cell: 
(Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
    min      1%      5%     10%     50%    100% 
 212.00  325.00  434.95  539.90  816.00 3400.00 
07:38 Finished running AUCell.
07:38 Plotting heatmap...
07:38 Plotting t-SNEs...
> scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)
Binary regulon activity: 11 TF regulons x 1678 cells.
(19 regulons including 'extended' versions)
11 regulons are active in more than 1% (16.78) cells.
> tsneAUC(scenicOptions, aucType="AUC") # choose settings
[1] "int/tSNE_AUC_50pcs_30perpl.Rds"` </pre>

可以看到，对于我们这个真实数据，就是PBMC3K的，也只有19个regulon被挑选出来了，涉及到11个TF基因

作者推荐的运算结果保存是：

export2loom(scenicOptions, exprMat)
saveRDS(scenicOptions, file="int/scenicOptions.Rds")` </pre>

实际上我们也用不上哈！

输出结果的解读

首先看看转录因子富集结果：

rm(list = ls()) 
library(Seurat) 
library(SCENIC)
library(doParallel)

scenicOptions=readRDS(file="int/scenicOptions.Rds")

### Exploring output 
# Check files in folder 'output'
# Browse the output .loom file @ http://scope.aertslab.org

# output/Step2_MotifEnrichment_preview.html in detail/subset:
motifEnrichment_selfMotifs_wGenes <- loadInt(scenicOptions, "motifEnrichment_selfMotifs_wGenes") 
as.data.frame(sort(table(motifEnrichment_selfMotifs_wGenes$highlightedTFs),decreasing = T))` </pre>

每个基因的motif数量：

> as.data.frame(sort(table(motifEnrichment_selfMotifs_wGenes$highlightedTFs),decreasing = T))
      Var1 Freq
1     SPI1   61
2     IRF7   59
3    TBX21   47
4    STAT1   33
5     SPIB   27
6     MAFB   26
7     IRF8   23
8    CEBPD   21
9      FOS   10
10 POU2AF1    5
11   CEBPB    3
12    TCF7    3
13     MAX    2
14    FOSB    1
15    LEF1    1
16    LYL1    1` </pre>

可视化IRF7基因的motif序列特征：

tableSubset <- motifEnrichment_selfMotifs_wGenes[highlightedTFs=="IRF7"]
viewMotifs(tableSubset)` </pre>

这个时候的IRF7基因有 56 个motif，如下所示：

图片

如果加上活性单元（regulon）的限定后：

regulonTargetsInfo <- loadInt(scenicOptions, "regulonTargetsInfo")
tableSubset <- regulonTargetsInfo[TF=="IRF7" & highConfAnnot==TRUE]
viewMotifs(tableSubset)` </pre>

就只有12个啦，不过我们需要的并不是这些结果啦。

如果要理解（regulon），需要看我分享的：使用AUCell包的AUCell_calcAUC函数计算每个细胞的每个基因集的活性程度

可以使用的结果：

其中output文件夹本来就已经自动绘制了大量的图表供使用，而图表对应的数据就存储在 loomFile 里面，可以使用下面的代码重新获取：

rm(list = ls()) 
library(Seurat) 
library(SCENIC)
library(doParallel)
library(SCopeLoomR)
scenicOptions=readRDS(file="int/scenicOptions.Rds")
scenicLoomPath <- getOutName(scenicOptions, "loomFile")
loom <- open_loom(scenicLoomPath)
# Read information from loom file:
regulons_incidMat <- get_regulons(loom)
regulons <- regulonsToGeneLists(regulons_incidMat)
regulonsAUC <- get_regulons_AUC(loom)
regulonsAucThresholds <- get_regulon_thresholds(loom)
embeddings <- get_embeddings(loom)` </pre>

可以可视化其中一些TF的AUC值，也可以根据这些TF的AUC值对细胞亚群进行重新降维聚类分群。可以很容易看到血液里面的不同细胞亚群的特异性的转录调控因子：

图片

如果使用这些转录调控因子进行降维聚类分群，可以得到：

图片

原文：单细胞转录因子分析之SCENIC流程