单细胞数据分析笔记6: 转录因子分析（SCENIC）

转录因子(transcription factors, TFs)是指能够识别、结合在某基因上游特异核苷酸序列上的蛋白质。其通过介导RNA聚合酶与DNA模板的结合，从而调控下游靶基因的转录；也可以和其它转录因子形成转录因子复合体来影响基因的转录。转录因子在生命活动过程中,参与调节基因组DNA开放性、募集RNA聚合酶进行转录过程、募集辅助因子调节特定的转录阶段,调控生命体的免疫、发育等进程。
转录因子特异识别、结合的DNA序列称为转录因子结合位点(Transcription factor binding site,TFBS)，长度在5~20bp范围。由于同一个TF可以调控多个，而其具体结合到每个靶基因的TFBS不完全相同，但具有一定的保守性。

1. $\color{green}{SCENIC}$

SCENIC(Single-Cell Regulatory Network Inference and Clustering)分析是对ScRNA-Seq数据中转录因子（Transcription Factors，TFs）进行研究，最终筛选得到调控强度显著、处于核心作用的TFs，从而为探究其发病机制奠定基础。目前，SCENIC已改为pySCENIC

SCENIC流程包括三步骤:
(1)使用GENIE3或GRNBoost(Gradient Boosting)基于共表达，推断转录因子与候选靶基因之间的共表达模块。

(2)由于GENIE3模型只是基于共表达，会存在很多假阳性和间接靶标，为了识别直接结合靶标(direct-binding targets)，使用RcisTarget对每个共表达模块进行顺式调控基序(cis-regulatory motif）分析。进行TF-motif富集分析，识别直接靶标。(仅保留具有正确的上游调节子且显著富集的motif modules，并对它们进行修剪以除去缺乏motif支持的间接靶标。)这些处理后的每个TF及其潜在的直接targets gene被称作一个调节子(regulon);

(3）使用AUCell算法对每个细胞的每个regulon活性进行打分。对于一个regulon来说，比较细胞间的AUCell得分可以鉴定出哪种细胞有显著更高的subnetwork活性。结果生成一个二进制的regulon活性矩阵(binarized activity matrix)，这将确定Regulon在哪些细胞中处于"打开"状态

1.1 下载所需参考数据集（linux端）

$ cd /data/shumin/data/RcisTarget 

# （1）motif–gene注释数据(genome ranking database): https://resources.aertslab.org/cistarget/databases/
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather &
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_500bp_up_100bp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather &

# （2）motif–TF注释数据: https://resources.aertslab.org/cistarget/motif2tf/
wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl

# （3）转录因子列表: https://resources.aertslab.org/cistarget/tf_lists/
wget https://resources.aertslab.org/cistarget/tf_lists/allTFs_hg38.txt

1.2 数据预处理

# 安装Bioconductor和相关包
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install(c("SCENIC", "GENIE3", "RcisTarget"))

# 安装其他依赖包
install.packages(c("Matrix", "data.table", "igraph", "pheatmap", "ggplot2"))

# 加载所需的R包
library(SCENIC)
library(GENIE3)
library(RcisTarget)
library(Matrix)
library(data.table)
library(igraph)
library(pheatmap)
library(ggplot2)

# 加载表达数据矩阵（假设数据已经是一个数据框）
load("sub_Epithelial_0903.Rdata")
expr <- sub_Epithelial@assays[["RNA"]]@counts
expr_matrix <- as.matrix(expr)

# 提取细胞信息/表型数据
cellInfo <- data.frame(sub_Epithelial@meta.data)
cellInfo <- cellInfo[,c("Sample","RNA_snn_res.0.6","Group","subtype")]
head(cellInfo)
##                   Sample RNA_snn_res.0.6  Group             subtype
##  AGGAGGTATGTAGTCTC    N11              13 Normal mucous acinar cells
##  TGCACGGATGGACCATA    N11               5 Normal serous acinar cells
##  CCCTTAGATGCACGTCT    N11               6 Normal serous acinar cells
##  GTGGGAACGATTGCAGT    N11               5 Normal serous acinar cells
##  ACTACGATACTGCAGCG    N11               8 Normal        ductal cells
##  GACATCACGATACTCAG    N11               6 Normal serous acinar cells

dir.create("int")
saveRDS(cellInfo, file="int/cellInfo.Rds")

# 设置SCENIC分析的参数
mydbDIR <- "/data/shumin/data/RcisTarget"
mydbs <- c("hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather",
           "hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather")
names(mydbs) <- c("500bp", "10kb")

scenicOptions <- initializeScenic(org = "hgnc", 
                                  nCores = 15,
                                  dbDir = mydbDIR, 
                                  dbs = mydbs,
                                  datasetTitle = "pSS")

saveRDS(scenicOptions, "int/scenicOptions.Rds")

1.3 计算共表达网络

SCENIC正式分析的第一步是计算转录因子与每个基因的相关性

## 基因过滤
# 过滤标准:基因表达量之和 > 细胞数*3%，且在1%的细胞中表达
genesKept <- geneFiltering(expr_matrix, scenicOptions = scenicOptions,
                           minCountsPerGene = 3*.01*ncol(expr_matrix),
                           minSamples = ncol(expr_matrix)*.01)
## Maximum value in the expression matrix: 61516
## Ratio of detected vs non-detected: 0.034
## Number of counts (in the dataset units) per gene:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0        1       27     6130      670 76233910 
## Number of cells in which each gene is detected:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     1.0    25.0   918.5   563.0 25600.0 
## 
## Number of genes left after applying the following filters (sequential):
##  8444    genes with counts per gene > 836.88
##  8348    genes detected in more than 278.96 cells
## Using the column 'features' as feature index for the ranking database.
##  7226    genes available in RcisTarget database
## Gene list saved in int/1.1_genesKept.Rds

exprMat_filtered <- expr_matrix[genesKept, ]
dim(exprMat_filtered)
## [1]  7864 27896

# 计算相关性矩阵
runCorrelation(exprMat_filtered, scenicOptions)

# TF-Targets相关性回归分析
exprMat_filtered_log <- log2(exprMat_filtered + 1)

data(list = "motifAnnotations_hgnc_v9", package = "RcisTarget")
motifAnnotations_hgnc <- motifAnnotations_hgnc_v9

runGenie3(exprMat_filtered_log, scenicOptions) # 这一步消耗的计算资源非常大，需要较长时间
## Using 689 TFs as potential regulators...
## Running GENIE3 part 1
## Running GENIE3 part 2
## Running GENIE3 part 3
## Running GENIE3 part 4
## Running GENIE3 part 5
## Running GENIE3 part 6
## Running GENIE3 part 7
## Running GENIE3 part 8
## Running GENIE3 part 9
## Running GENIE3 part 10
## Running GENIE3 part 11
## Running GENIE3 part 12
## Running GENIE3 part 13
## Running GENIE3 part 14
## Running GENIE3 part 15
## Running GENIE3 part 16
## Running GENIE3 part 17
## Running GENIE3 part 18
## Running GENIE3 part 19
## Running GENIE3 part 20
## Finished running GENIE3.
## Warning message:
## In runGenie3(exprMat_filtered, scenicOptions, nParts = 20) :
##   Only 37% of the 1839 TFs in the database were found in the dataset. Do they use the same gene IDs?

以上代码运行后，int目录下生成相关不少中间结果：
ꔷ 1.2_corrMat.Rds：基因之间的相关性矩阵
ꔷ 1.3_GENIE3_weightMatrix_part_1.Rds等：GENIE3的中间结果
ꔷ 1.4_GENIE3_linkList.Rds：GENIE3最终结果，是把“1.3_”开头的文件合并在一起

1.4 推断共表达模块

接下来需要过滤低相关性的组合形成共表达基因集（模块）。SCENIC提供了多种策略过滤低相关性TF-Target，建议是6种过滤标准都用
ꔷ w001：以每个TF为核心保留weight>0.001的基因形成共表达模块；
ꔷ w005：以每个TF为核心保留weight>0.005的基因形成共表达模块；
ꔷ top50：以每个TF为核心保留weight值top50的基因形成共表达模块；
ꔷ top5perTarget：每个基因保留weight值top5的TF得到精简的TF-Target关联表，然后把基因分配给TF构建共表达模块；
ꔷ top10perTarget：每个基因保留weight值top10的TF得到精简的TF-Target关联表，然后把基因分配给TF构建共表达模块；
ꔷ top50perTarget：每个基因保留weight值top50的TF得到精简的TF-Target关联表，然后把基因分配给TF构建共表达模块；

# 推断共表达模块（主要运行结果是int目录下的1.6_tfModules_asDF.Rds）
scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)
## 15:50    Creating TF modules
##         75%         90% 
## 0.002407983 0.004071417 
## Number of links between TFs and targets: 2551778
##              [,1]
## nTFs          689
## nTargets     7864
## nGeneSets    2803
## nLinks    3275940

经过上述分析每个转录因子都找到了强相关的靶基因，接下来过滤共表达模块形成有生物学意义的调控单元（regulons）：
ꔷ 对每个共表达模块进行motif富集分析，保留显著富集的motif（此项分析依赖gene-motif评分（排行）数据库，其行为基因、列为motif、值为排名，就是下载的cisTarget数据库）
ꔷ 使用数据库对motif进行TF注释，注释结果分高、低可信度。数据库直接注释和同源基因推断的TF是高可信结果，使用motif序列相似性注释的TF是低可信结果。
ꔷ 用保留的motif对共表达模块内的基因进行打分（同样依据cisTarget数据库），识别显著高分的基因（理解为motif离这些基因的TSS很近）；
ꔷ 删除共表达模块内与motif评分不高的基因，剩下的基因集称为调控单元（regulon）

# 推断转录调控网络（regulon）
scenicOptions <- runSCENIC_2_createRegulons(scenicOptions, coexMethod=c("top5perTarget"))  #** Only for toy run!!
## 15:52    Step 2. Identifying regulons
## tfModulesSummary:
## top5perTarget 
##           107 
## 15:52    RcisTarget: Calculating AUC
## Using the column 'features' as feature index for the ranking database.
## Scoring database:  [Source file: hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather]
## Using the column 'features' as feature index for the ranking database.
## Scoring database:  [Source file: hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather]
## 15:53    RcisTarget: Adding motif annotation
## Using BiocParallel...
## Using BiocParallel...
## Number of motifs in the initial enrichment: 53899
## Number of motifs annotated to the matching TF: 1218
## 15:53    RcisTarget: Prunning targets
## Using the column 'features' as feature index for the ranking database.
## Using the column 'features' as feature index for the ranking database.
## 15:56    Number of motifs that support the regulons: 1218
##  Preview of motif enrichment saved as: output/Step2_MotifEnrichment_preview.html
## Warning message:
## In RcisTarget::addLogo(tableSubset, motifCol = motifCol, dbVersion = dbVersion,  :
##   There is no annotation version attribute in the input table (it has probably been loaded with an older version of the package).'v9' will be used as it ## was the old default,but we recommend to re-load the annotations and/or re-run the enrichment to make sure everything is consistent.

★ 重要的结果保存在output文件夹：
1）Step2_MotifEnrichment.tsv：各个共表达模块显著富集motif的注释信息

SCENIC-1

2）Step2_MotifEnrichment_preview.html

SCENIC-2

3）Step2_regulonTargetsInfo.tsv：最终的regulon文件（后续分析找到有价值的regulon，需要回到这个文件找对应的转录因子和靶基因）

SCENIC-3

每个Regulon就是一个转录因子及其直接调控靶基因的基因集，接下来的工作就是对每个regulon在各个细胞中的活性评分。评分的基础是基因的表达值，分数越高代表基因集的激活程度越高

# 计算细胞网络评分
scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_filtered_log)
## 17:16    Step 3. Analyzing the network activity in each individual cell
##  Number of regulons to evaluate on cells: 40
## Biggest (non-extended) regulons: 
##   XBP1 (267g)
##   NFKB1 (96g)
##   ELF1 (87g)
##   FOSL1 (81g)
##   FOS (79g)
##   TEAD1 (61g)
##   SPDEF (56g)
##   CREB3L4 (37g)
##   EGR1 (33g)
##   IRF1 (31g)
## Quantiles for the number of genes detected by cell: 
## (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
##  min   1%   5%  10%  50% 100% 
##   59  208  303  383  944 4139 
## Using 15 cores.
## Warning in .AUCell_calcAUC(geneSets = geneSets, rankings = rankings, nCores = nCores,  :
##   Using only the first 208 genes (aucMaxRank) to calculate the AUC.
## Using 15 cores with doMC.
## 17:20    Finished running AUCell.
## 17:20    Plotting heatmap...
## 17:20    Plotting t-SNEs...

saveRDS(scenicOptions, file="int/scenicOptions.Rds") # To save status

★ runSCENIC_3_scoreCells()是一个多种功能打包的函数，它包含的子功能有：
1）计算regulon在每个细胞中AUC值，最后得到一个以regulon为行细胞为列的矩阵。结果：int/3.4_regulonAUC.Rds
2）计算每个regulon的AUC阈值，细胞中regulonAUC值>阈值，代表此regulon在细胞中处于激活状态，否则代表沉默状态。结果：int/3.5_AUCellThresholds.Rds
3）使用regulonAUC矩阵对细胞进行降维聚类
4）用heatmap图展示regulonAUC矩阵，用t-SNE图分别展示每个regulon的活性分布情况。结果：output/Step3_开头的系列文件

二进制转换及衍生分析（将regulonAUC矩阵转换为二进制矩阵后，会重新降维聚类，输出的结果与runSCENIC_3_scoreCells()类似）

scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)
## Binary regulon activity: 28 TF regulons x 26877 cells.
## (40 regulons including 'extended' versions)
## 28 regulons are active in more than 1% (268.77) cells.

1.5 调整阈值（可选）

#使用shiny互动调整阈值
aucellApp <- plotTsne_AUCellApp(scenicOptions, exprMat_all)
savedSelections <- shiny::runApp(aucellApp)
#保存调整后的阈值
newThresholds <- savedSelections$thresholds
scenicOptions@fileNames$int["aucell_thresholds",1] <- "int/newThresholds.Rds"
saveRDS(newThresholds, file=getIntName(scenicOptions, "aucell_thresholds"))
saveRDS(scenicOptions, file="int/scenicOptions.Rds")

1.6 SCENIC结果可视化

runSCENIC_3_scoreCells()和runSCENIC_4_aucell_binarize()运行之后会生成一些可视化的heatmap图与tSNE图，但不美观，不宜与seurat联系起来，故可重新自行做图

##导入原始regulonAUC矩阵
AUCmatrix <- readRDS("int/3.4_regulonAUC.Rds")
AUCmatrix <- AUCmatrix@assays@data@listData$AUC
AUCmatrix <- data.frame(t(AUCmatrix), check.names=F)
RegulonName_AUC <- colnames(AUCmatrix)
RegulonName_AUC <- gsub(' \\(','_',RegulonName_AUC)
RegulonName_AUC <- gsub('\\)','',RegulonName_AUC)
colnames(AUCmatrix) <- RegulonName_AUC
sub_Epithelialauc <- AddMetaData(sub_Epithelial, AUCmatrix)
sub_Epithelialauc@assays$integrated <- NULL
saveRDS(sub_Epithelialauc,'sub_Epithelialauc.rds')

##导入二进制regulonAUC矩阵
BINmatrix <- readRDS("int/4.1_binaryRegulonActivity.Rds")
BINmatrix <- data.frame(t(BINmatrix), check.names=F)
RegulonName_BIN <- colnames(BINmatrix)
RegulonName_BIN <- gsub(' \\(','_',RegulonName_BIN)
RegulonName_BIN <- gsub('\\)','',RegulonName_BIN)
colnames(BINmatrix) <- RegulonName_BIN
sub_Epithelialbin <- AddMetaData(sub_Epithelial, BINmatrix)
sub_Epithelialbin@assays$integrated <- NULL
saveRDS(sub_Epithelialbin, 'sub_Epithelialbin.rds')

##利用Seurat可视化AUC
dir.create('scenic_seurat')
#FeaturePlot
p1 = FeaturePlot(sub_Epithelialauc, features='CD59_extended_34g', label=T, reduction = 'umap')
p2 = FeaturePlot(sub_Epithelialbin, features='CD59_extended_34g', label=T, reduction = 'umap')
p3 = DimPlot(sub_Epithelial, reduction = 'umap', group.by = "subtype", label=T)
p1|p2|p3

SCENIC-4

# RidgePlot&VlnPlot
p1 = RidgePlot(sub_Epithelialauc, features = "CD59_extended_34g", group.by="subtype") + 
               theme(legend.position='none')
p2 = VlnPlot(sub_Epithelialauc, features = "CD59_extended_34g", pt.size = 0, group.by="subtype") + 
             theme(legend.position='none')
p1 + p2

SCENIC-5.png

library(pheatmap)
cellInfo <- readRDS("int/cellInfo.Rds")
subtype = subset(cellInfo,select = 'subtype')
AUCmatrix <- t(AUCmatrix)
BINmatrix <- t(BINmatrix)
#挑选部分感兴趣的regulons
my.regulons <- c('CD59_extended_34g','CEBPD_extended_14g','CHD2_extended_668g','TEAD1_extended_68g','TEAD1_61g',
                 'JUNB_extended_24g','CREB3L4_37g','SPDEF_56g','CREB3L4_extended_124g','IRF1_31g','IRF1_extended_40g')
myAUCmatrix <- AUCmatrix[rownames(AUCmatrix)%in%my.regulons,]
myBINmatrix <- BINmatrix[rownames(BINmatrix)%in%my.regulons,]
#使用regulon原始AUC值绘制热图
pheatmap(myAUCmatrix, show_colnames=F, annotation_col=subtype)
#使用regulon二进制AUC值绘制热图
pheatmap(myBINmatrix, show_colnames=F, annotation_col=subtype,
         color = colorRampPalette(colors = c("white","black"))(100))

SCENIC-6

参考教程：

单细胞转录组高级分析二：转录调控网络分析（https://cloud.tencent.com/developer/article/1692240）

单细胞数据分析笔记6: 转录因子分析（SCENIC）