转录因子(transcription factors, TFs)是指能够识别、结合在某基因上游特异核苷酸序列上的蛋白质。其通过介导RNA聚合酶与DNA模板的结合,从而调控下游靶基因的转录;也可以和其它转录因子形成转录因子复合体来影响基因的转录。转录因子在生命活动过程中,参与调节基因组DNA开放性、募集RNA聚合酶进行转录过程、募集辅助因子调节特定的转录阶段,调控生命体的免疫、发育等进程。
转录因子特异识别、结合的DNA序列称为转录因子结合位点(Transcription factor binding site,TFBS),长度在5~20bp范围。由于同一个TF可以调控多个,而其具体结合到每个靶基因的TFBS不完全相同,但具有一定的保守性。
1.
SCENIC(Single-Cell Regulatory Network Inference and Clustering)分析是对ScRNA-Seq数据中转录因子(Transcription Factors,TFs)进行研究,最终筛选得到调控强度显著、处于核心作用的TFs,从而为探究其发病机制奠定基础。目前,SCENIC已改为pySCENIC
SCENIC流程包括三步骤:
(1)使用GENIE3或GRNBoost(Gradient Boosting)基于共表达,推断转录因子与候选靶基因之间的共表达模块。
(2)由于GENIE3模型只是基于共表达,会存在很多假阳性和间接靶标,为了识别直接结合靶标(direct-binding targets),使用RcisTarget对每个共表达模块进行顺式调控基序(cis-regulatory motif)分析。进行TF-motif富集分析,识别直接靶标。(仅保留具有正确的上游调节子且显著富集的motif modules,并对它们进行修剪以除去缺乏motif支持的间接靶标。)这些处理后的每个TF及其潜在的直接targets gene被称作一个调节子(regulon);
(3)使用AUCell算法对每个细胞的每个regulon活性进行打分。对于一个regulon来说,比较细胞间的AUCell得分可以鉴定出哪种细胞有显著更高的subnetwork活性。结果生成一个二进制的regulon活性矩阵(binarized activity matrix),这将确定Regulon在哪些细胞中处于"打开"状态
1.1 下载所需参考数据集(linux端)
$ cd /data/shumin/data/RcisTarget
# (1)motif–gene注释数据(genome ranking database): https://resources.aertslab.org/cistarget/databases/
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather &
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_500bp_up_100bp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather &
# (2)motif–TF注释数据: https://resources.aertslab.org/cistarget/motif2tf/
wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl
# (3)转录因子列表: https://resources.aertslab.org/cistarget/tf_lists/
wget https://resources.aertslab.org/cistarget/tf_lists/allTFs_hg38.txt
1.2 数据预处理
# 安装Bioconductor和相关包
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(c("SCENIC", "GENIE3", "RcisTarget"))
# 安装其他依赖包
install.packages(c("Matrix", "data.table", "igraph", "pheatmap", "ggplot2"))
# 加载所需的R包
library(SCENIC)
library(GENIE3)
library(RcisTarget)
library(Matrix)
library(data.table)
library(igraph)
library(pheatmap)
library(ggplot2)
# 加载表达数据矩阵(假设数据已经是一个数据框)
load("sub_Epithelial_0903.Rdata")
expr <- sub_Epithelial@assays[["RNA"]]@counts
expr_matrix <- as.matrix(expr)
# 提取细胞信息/表型数据
cellInfo <- data.frame(sub_Epithelial@meta.data)
cellInfo <- cellInfo[,c("Sample","RNA_snn_res.0.6","Group","subtype")]
head(cellInfo)
## Sample RNA_snn_res.0.6 Group subtype
## AGGAGGTATGTAGTCTC N11 13 Normal mucous acinar cells
## TGCACGGATGGACCATA N11 5 Normal serous acinar cells
## CCCTTAGATGCACGTCT N11 6 Normal serous acinar cells
## GTGGGAACGATTGCAGT N11 5 Normal serous acinar cells
## ACTACGATACTGCAGCG N11 8 Normal ductal cells
## GACATCACGATACTCAG N11 6 Normal serous acinar cells
dir.create("int")
saveRDS(cellInfo, file="int/cellInfo.Rds")
# 设置SCENIC分析的参数
mydbDIR <- "/data/shumin/data/RcisTarget"
mydbs <- c("hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather",
"hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather")
names(mydbs) <- c("500bp", "10kb")
scenicOptions <- initializeScenic(org = "hgnc",
nCores = 15,
dbDir = mydbDIR,
dbs = mydbs,
datasetTitle = "pSS")
saveRDS(scenicOptions, "int/scenicOptions.Rds")
1.3 计算共表达网络
SCENIC正式分析的第一步是计算转录因子与每个基因的相关性
## 基因过滤
# 过滤标准:基因表达量之和 > 细胞数*3%,且在1%的细胞中表达
genesKept <- geneFiltering(expr_matrix, scenicOptions = scenicOptions,
minCountsPerGene = 3*.01*ncol(expr_matrix),
minSamples = ncol(expr_matrix)*.01)
## Maximum value in the expression matrix: 61516
## Ratio of detected vs non-detected: 0.034
## Number of counts (in the dataset units) per gene:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1 27 6130 670 76233910
## Number of cells in which each gene is detected:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 1.0 25.0 918.5 563.0 25600.0
##
## Number of genes left after applying the following filters (sequential):
## 8444 genes with counts per gene > 836.88
## 8348 genes detected in more than 278.96 cells
## Using the column 'features' as feature index for the ranking database.
## 7226 genes available in RcisTarget database
## Gene list saved in int/1.1_genesKept.Rds
exprMat_filtered <- expr_matrix[genesKept, ]
dim(exprMat_filtered)
## [1] 7864 27896
# 计算相关性矩阵
runCorrelation(exprMat_filtered, scenicOptions)
# TF-Targets相关性回归分析
exprMat_filtered_log <- log2(exprMat_filtered + 1)
data(list = "motifAnnotations_hgnc_v9", package = "RcisTarget")
motifAnnotations_hgnc <- motifAnnotations_hgnc_v9
runGenie3(exprMat_filtered_log, scenicOptions) # 这一步消耗的计算资源非常大,需要较长时间
## Using 689 TFs as potential regulators...
## Running GENIE3 part 1
## Running GENIE3 part 2
## Running GENIE3 part 3
## Running GENIE3 part 4
## Running GENIE3 part 5
## Running GENIE3 part 6
## Running GENIE3 part 7
## Running GENIE3 part 8
## Running GENIE3 part 9
## Running GENIE3 part 10
## Running GENIE3 part 11
## Running GENIE3 part 12
## Running GENIE3 part 13
## Running GENIE3 part 14
## Running GENIE3 part 15
## Running GENIE3 part 16
## Running GENIE3 part 17
## Running GENIE3 part 18
## Running GENIE3 part 19
## Running GENIE3 part 20
## Finished running GENIE3.
## Warning message:
## In runGenie3(exprMat_filtered, scenicOptions, nParts = 20) :
## Only 37% of the 1839 TFs in the database were found in the dataset. Do they use the same gene IDs?
以上代码运行后,int目录下生成相关不少中间结果:
ꔷ 1.2_corrMat.Rds:基因之间的相关性矩阵
ꔷ 1.3_GENIE3_weightMatrix_part_1.Rds等:GENIE3的中间结果
ꔷ 1.4_GENIE3_linkList.Rds:GENIE3最终结果,是把“1.3_”开头的文件合并在一起
1.4 推断共表达模块
接下来需要过滤低相关性的组合形成共表达基因集(模块)。SCENIC提供了多种策略过滤低相关性TF-Target,建议是6种过滤标准都用
ꔷ w001:以每个TF为核心保留weight>0.001的基因形成共表达模块;
ꔷ w005:以每个TF为核心保留weight>0.005的基因形成共表达模块;
ꔷ top50:以每个TF为核心保留weight值top50的基因形成共表达模块;
ꔷ top5perTarget:每个基因保留weight值top5的TF得到精简的TF-Target关联表,然后把基因分配给TF构建共表达模块;
ꔷ top10perTarget:每个基因保留weight值top10的TF得到精简的TF-Target关联表,然后把基因分配给TF构建共表达模块;
ꔷ top50perTarget:每个基因保留weight值top50的TF得到精简的TF-Target关联表,然后把基因分配给TF构建共表达模块;
# 推断共表达模块(主要运行结果是int目录下的1.6_tfModules_asDF.Rds)
scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)
## 15:50 Creating TF modules
## 75% 90%
## 0.002407983 0.004071417
## Number of links between TFs and targets: 2551778
## [,1]
## nTFs 689
## nTargets 7864
## nGeneSets 2803
## nLinks 3275940
经过上述分析每个转录因子都找到了强相关的靶基因,接下来过滤共表达模块形成有生物学意义的调控单元(regulons):
ꔷ 对每个共表达模块进行motif富集分析,保留显著富集的motif(此项分析依赖gene-motif评分(排行)数据库,其行为基因、列为motif、值为排名,就是下载的cisTarget数据库)
ꔷ 使用数据库对motif进行TF注释,注释结果分高、低可信度 。数据库直接注释和同源基因推断的TF是高可信结果,使用motif序列相似性注释的TF是低可信结果。
ꔷ 用保留的motif对共表达模块内的基因进行打分(同样依据cisTarget数据库),识别显著高分的基因(理解为motif离这些基因的TSS很近);
ꔷ 删除共表达模块内与motif评分不高的基因,剩下的基因集称为调控单元(regulon)
# 推断转录调控网络(regulon)
scenicOptions <- runSCENIC_2_createRegulons(scenicOptions, coexMethod=c("top5perTarget")) #** Only for toy run!!
## 15:52 Step 2. Identifying regulons
## tfModulesSummary:
## top5perTarget
## 107
## 15:52 RcisTarget: Calculating AUC
## Using the column 'features' as feature index for the ranking database.
## Scoring database: [Source file: hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather]
## Using the column 'features' as feature index for the ranking database.
## Scoring database: [Source file: hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather]
## 15:53 RcisTarget: Adding motif annotation
## Using BiocParallel...
## Using BiocParallel...
## Number of motifs in the initial enrichment: 53899
## Number of motifs annotated to the matching TF: 1218
## 15:53 RcisTarget: Prunning targets
## Using the column 'features' as feature index for the ranking database.
## Using the column 'features' as feature index for the ranking database.
## 15:56 Number of motifs that support the regulons: 1218
## Preview of motif enrichment saved as: output/Step2_MotifEnrichment_preview.html
## Warning message:
## In RcisTarget::addLogo(tableSubset, motifCol = motifCol, dbVersion = dbVersion, :
## There is no annotation version attribute in the input table (it has probably been loaded with an older version of the package).'v9' will be used as it ## was the old default,but we recommend to re-load the annotations and/or re-run the enrichment to make sure everything is consistent.
★ 重要的结果保存在output文件夹:
1)Step2_MotifEnrichment.tsv:各个共表达模块显著富集motif的注释信息
2)Step2_MotifEnrichment_preview.html
3)Step2_regulonTargetsInfo.tsv:最终的regulon文件(后续分析找到有价值的regulon,需要回到这个文件找对应的转录因子和靶基因)
每个Regulon就是一个转录因子及其直接调控靶基因的基因集,接下来的工作就是对每个regulon在各个细胞中的活性评分。评分的基础是基因的表达值,分数越高代表基因集的激活程度越高
# 计算细胞网络评分
scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_filtered_log)
## 17:16 Step 3. Analyzing the network activity in each individual cell
## Number of regulons to evaluate on cells: 40
## Biggest (non-extended) regulons:
## XBP1 (267g)
## NFKB1 (96g)
## ELF1 (87g)
## FOSL1 (81g)
## FOS (79g)
## TEAD1 (61g)
## SPDEF (56g)
## CREB3L4 (37g)
## EGR1 (33g)
## IRF1 (31g)
## Quantiles for the number of genes detected by cell:
## (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
## min 1% 5% 10% 50% 100%
## 59 208 303 383 944 4139
## Using 15 cores.
## Warning in .AUCell_calcAUC(geneSets = geneSets, rankings = rankings, nCores = nCores, :
## Using only the first 208 genes (aucMaxRank) to calculate the AUC.
## Using 15 cores with doMC.
## 17:20 Finished running AUCell.
## 17:20 Plotting heatmap...
## 17:20 Plotting t-SNEs...
saveRDS(scenicOptions, file="int/scenicOptions.Rds") # To save status
★ runSCENIC_3_scoreCells()是一个多种功能打包的函数,它包含的子功能有:
1)计算regulon在每个细胞中AUC值,最后得到一个以regulon为行细胞为列的矩阵。结果:int/3.4_regulonAUC.Rds
2)计算每个regulon的AUC阈值,细胞中regulonAUC值>阈值,代表此regulon在细胞中处于激活状态,否则代表沉默状态。结果:int/3.5_AUCellThresholds.Rds
3)使用regulonAUC矩阵对细胞进行降维聚类
4)用heatmap图展示regulonAUC矩阵,用t-SNE图分别展示每个regulon的活性分布情况。结果:output/Step3_开头的系列文件
二进制转换及衍生分析(将regulonAUC矩阵转换为二进制矩阵后,会重新降维聚类,输出的结果与runSCENIC_3_scoreCells()类似)
scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)
## Binary regulon activity: 28 TF regulons x 26877 cells.
## (40 regulons including 'extended' versions)
## 28 regulons are active in more than 1% (268.77) cells.
1.5 调整阈值(可选)
#使用shiny互动调整阈值
aucellApp <- plotTsne_AUCellApp(scenicOptions, exprMat_all)
savedSelections <- shiny::runApp(aucellApp)
#保存调整后的阈值
newThresholds <- savedSelections$thresholds
scenicOptions@fileNames$int["aucell_thresholds",1] <- "int/newThresholds.Rds"
saveRDS(newThresholds, file=getIntName(scenicOptions, "aucell_thresholds"))
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
1.6 SCENIC结果可视化
runSCENIC_3_scoreCells()和runSCENIC_4_aucell_binarize()运行之后会生成一些可视化的heatmap图与tSNE图,但不美观,不宜与seurat联系起来,故可重新自行做图
##导入原始regulonAUC矩阵
AUCmatrix <- readRDS("int/3.4_regulonAUC.Rds")
AUCmatrix <- AUCmatrix@assays@data@listData$AUC
AUCmatrix <- data.frame(t(AUCmatrix), check.names=F)
RegulonName_AUC <- colnames(AUCmatrix)
RegulonName_AUC <- gsub(' \\(','_',RegulonName_AUC)
RegulonName_AUC <- gsub('\\)','',RegulonName_AUC)
colnames(AUCmatrix) <- RegulonName_AUC
sub_Epithelialauc <- AddMetaData(sub_Epithelial, AUCmatrix)
sub_Epithelialauc@assays$integrated <- NULL
saveRDS(sub_Epithelialauc,'sub_Epithelialauc.rds')
##导入二进制regulonAUC矩阵
BINmatrix <- readRDS("int/4.1_binaryRegulonActivity.Rds")
BINmatrix <- data.frame(t(BINmatrix), check.names=F)
RegulonName_BIN <- colnames(BINmatrix)
RegulonName_BIN <- gsub(' \\(','_',RegulonName_BIN)
RegulonName_BIN <- gsub('\\)','',RegulonName_BIN)
colnames(BINmatrix) <- RegulonName_BIN
sub_Epithelialbin <- AddMetaData(sub_Epithelial, BINmatrix)
sub_Epithelialbin@assays$integrated <- NULL
saveRDS(sub_Epithelialbin, 'sub_Epithelialbin.rds')
##利用Seurat可视化AUC
dir.create('scenic_seurat')
#FeaturePlot
p1 = FeaturePlot(sub_Epithelialauc, features='CD59_extended_34g', label=T, reduction = 'umap')
p2 = FeaturePlot(sub_Epithelialbin, features='CD59_extended_34g', label=T, reduction = 'umap')
p3 = DimPlot(sub_Epithelial, reduction = 'umap', group.by = "subtype", label=T)
p1|p2|p3
# RidgePlot&VlnPlot
p1 = RidgePlot(sub_Epithelialauc, features = "CD59_extended_34g", group.by="subtype") +
theme(legend.position='none')
p2 = VlnPlot(sub_Epithelialauc, features = "CD59_extended_34g", pt.size = 0, group.by="subtype") +
theme(legend.position='none')
p1 + p2
library(pheatmap)
cellInfo <- readRDS("int/cellInfo.Rds")
subtype = subset(cellInfo,select = 'subtype')
AUCmatrix <- t(AUCmatrix)
BINmatrix <- t(BINmatrix)
#挑选部分感兴趣的regulons
my.regulons <- c('CD59_extended_34g','CEBPD_extended_14g','CHD2_extended_668g','TEAD1_extended_68g','TEAD1_61g',
'JUNB_extended_24g','CREB3L4_37g','SPDEF_56g','CREB3L4_extended_124g','IRF1_31g','IRF1_extended_40g')
myAUCmatrix <- AUCmatrix[rownames(AUCmatrix)%in%my.regulons,]
myBINmatrix <- BINmatrix[rownames(BINmatrix)%in%my.regulons,]
#使用regulon原始AUC值绘制热图
pheatmap(myAUCmatrix, show_colnames=F, annotation_col=subtype)
#使用regulon二进制AUC值绘制热图
pheatmap(myBINmatrix, show_colnames=F, annotation_col=subtype,
color = colorRampPalette(colors = c("white","black"))(100))
参考教程:
- 单细胞转录组高级分析二:转录调控网络分析(https://cloud.tencent.com/developer/article/1692240)