Last week we learned Transcription factor motif finding, also, recently I was using SCENIC to find out some key TFs which are responsible for the generations and functions of tumor associated neutrophils (TAN). So in this mini-review, I want to introduce how SCENIC find out the key transcription factors aertslab/SCENIC: vignettes/detailedStep_1_coexNetwork2modules.Rmd (rdrr.io).
SCENIC (Single Cell rEgulatory Network Inference and Clustering) is a tool to infer Gene Regulatory Networks and their associated TFs from single-cell RNA-seq data. SCENIC was developed by aertslab (1), and depended on three tools which also developed by them: GENIE3, RcisTarget, AUCell. In this review I will focus on the first two tools.
1. GENIE3 (GEne Network Inference with Ensemble of trees) (2)
The aim of this tools is to give rise to the genetic regulatory networks (GRNs) using expression matrix. To achieve this destination, GENIE3 solve n regression problems: every target’s expression pattern will be predicted using all regulators’ expression patterns (In default, we consider one gene as a target, and all the other genes are its regulator. GENIE3 allow you to define specifical regulators instead of using all genes). The solvation of each regression problem will be relied on the tree-based ensemble methods Random Forests or Extra-Trees. Finally, the function “GENIE3” will return a matrix, with its row are targets, columns are regulaters (weightMat <- GENIE3(exprMatr, regulators=regulators)
), every cell [i,j] is the weight of the regulation of i on j (wij).
The default method is random forest (3). Random forest functions as an ensemble of individual decision trees (the number of trees can be set using nTrees
). Every tress is based on K genes (can be set using K) which is randomly selected from all the genes except target genes (all the regulators). Throughout the iterative process, each run results in the derivation of a regulatory network for a single specific target (4). The process repeats until a comprehensive network is generated for all individual targets. Therefore, the output is a compendium of individual regulatory networks for all the target genes (Figure1).
Here I will give a simple example how a single tree is constructed (Figure 2):
We randomly select k regulators. The corresponding target’s’ expression denoted as S.
We calculate the variance of S: Var(S)
We select a split node to separate S into two sets: St and Sf, we calculate their variance: Var(St), Var(Sf). St and Sf own #St and #Sf percent of S.
The Importance I of this split will be calculated as:
We chose the maximum of I(N) as current node and split.
For the newly generated St and Sf, consider it as new S, repeat step 2-5
Recursively …
The importance of 'reg1' (a regulator gene) to the target within each tree is measured by its contribution to the total importance.
After iterating across multiple trees, we aggregate these measures of importance to form an average value. To ease interpretation, this final average importance score is then normalized to reside within a range from 0 to 1. These normalized values represent the weights in the final output, providing an intuitive measure of the significance of each regulator gene in relation to the target.
SCENIC only care about TFs regulatory effect on other genes, so SCENIC will use TFs as regulators instead of all genes in default. After calculating the TFs’ regulatory network, SCENIC will create TF-modules to select the best targets for each TF:
a. Targets with weight > 0.001
b. Targets with weight > 0.005
c. Top 50 targets (targets with highest weight)
d. Targets for which the TF is within its top 5 regulators
e. Targets for which the TF is within its top 10 regulators
f. Targets for which the TF is within its top 50 regulators
Then, the GENIE3’s job was done, we got each TF’s best targets.
2. RcisTarget
RcisTarget functions to identify transcription factor (TF) binding motifs that are over-represented within a specified gene list. In the context of SCENIC, we already procure TF-modules for each transcription factor. Each individually obtained module is treated as a gene list and is assessed to determine whether its identified TF binding motifs match previous TF via GENIE3. Importantly, if a match is identified, this gene list (or module) is deemed a highly reliable TF regulatory network. This network is then utilized to estimate its activity within each single cell.
The calculations in RcisTarget are primarily based on the data from 'hg19-500bp-upstream-7species.mc9nr.feather' resource file. You can consider this as a matrix. The rows of the matrix are representative of different motifs, and the columns denote individual genes. Each element within the matrix [i,j] signifies the rank of the intensity of a motifi for genej's 500-bp upstream region. Essentially, this value represents the regulatory intensity of a particular motifi on genej. This matrix is look like as blow:
Also, we need the association between motifs and TFs, as below:
The calculation is as follow:
For every motif, we arrange all genes (or selected some genes, default top 5000 genes) on the x-axis in ascending order based on their rank value;
Then, we draw a step line graph and calculate the area under this line. The resulting value, expressed as a proportion of the total rectangle (from x=0 to x=total genes, y=0 to max rank), forms the Enrichment Score (ES);
After computing the ES for all motifs, we then calculate a Normalized Enrichment Score (NES):
In the end, motifs which have a NES greater than 3 are considered significantly enriched.
Based on the correlation between the motifs and transcription factors (TFs), we can infer whether the gene list is regulated by the TFs.
Therefore, we can perform this enrichment analysis for every TF-module of each TF, and confirm its regulatory effect on this gene set. Once we identify the highly credible TF-modules, we can use tools like AUCell to evaluate their activity in each individual cell.
Reference
- Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, et al. SCENIC: single-cell regulatory network inference and clustering. Nature methods. 2017;14(11):1083-6.
- Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLOS ONE. 2010;5(9):e12776.
- Geurts P, Irrthum A, Wehenkel L. Supervised learning with decision tree-based methods in computational and systems biology. Molecular BioSystems. 2009;5(12):1593-605.
- Breiman L, Friedman JH, Olshen RA, Stone CJ, editors. Classification and Regression Trees1984.