Differential gene expression (DGE) analysis
Learning Objectives
- Explain the experiment and its objectives
- Describe how to set up an RNA-seq project in R
- Describe the RNA-seq and the differential gene expression analysis workflow
- Explain why negative binomial distribution is used to model RNA-seq count data
1.Setting up
Loading libraries
For this analysis we will be using several R packages, some which have been installed from CRAN and others from Bioconductor. To use these packages (and the functions contained within them), we need to load the libraries. Add the following to your script and don’t forget to comment liberally!
## Setup
### Bioconductor and CRAN libraries used
library(DESeq2)
library(tidyverse)
library(RColorBrewer)
library(pheatmap)
library(DEGreport)
library(tximport)
library(ggplot2)
library(ggrepel)
Loading data
We use the data from Salmon
The pseudocounts generated by Salmon are represented as normalized TPM (transcripts per million) counts and map to transcripts. These need to be converted into non-normalized count estimates for performing DESeq2 analysis. To use DESeq2 we also need to collapse our abundance estimates from the transcript level to the gene-level. We will be using the R Bioconductor package tximport to do all of the above and get set up for DESeq2.
The first thing we need to do is create a variable that contains the paths to each of our quant.sf files. Then we will add names to our quant files which will allow us to easily discriminate between samples in the final output matrix.
## List all directories containing data
samples <- list.files(path = "./data", full.names = T, pattern="salmon$")
## Obtain a vector of all filenames including the path
files <- file.path(samples, "quant.sf")
## Since all quant files have the same name it is useful to have names for each element
names(files) <- str_replace(samples, "./data/", "") %>%
str_replace(".salmon", "")
Our Salmon index was generated with transcript sequences listed by Ensembl IDs, but tximport needs to know which genes these transcripts came from. We will use the annotation table that we downloaded to extract transcript to gene information.
# Load the annotation table for GrCh38
tx2gene <- read.delim("tx2gene_grch38_ens94.txt")
# Take a look at it
tx2gene %>% View()
tx2gene is a three-column dataframe linking transcript ID (column 1) to gene ID (column 2) to gene symbol (column 3). We will take the first two columns as input to tximport. The column names are not relevant, but the column order is (i.e trasncript ID must be first).
Now we are ready to run tximport. Note that although there is a column in our quant.sf files that corresponds to the estimated count value for each transcript, those valuse are correlated by effective length. What we want to do is use the countsFromAbundance=“lengthScaledTPM” argument. This will use the TPM column, and compute quantities that are on the same scale as original counts, except no longer correlated with transcript length across samples.
# Run tximport
txi <- tximport(files, type="salmon", tx2gene=tx2gene[,c("tx_id", "ensgene")], countsFromAbundance="lengthScaledTPM")
Viewing data
The txi object is a simple list containing matrices of the abundance, counts, length. Another list element ‘countsFromAbundance’ carries through the character argument used in the tximport call. The length matrix contains the average transcript length for each gene which can be used as an offset for gene-level analysis.
attributes(txi)
$names
[1] "abundance" "counts" "length"
[4] "countsFromAbundance"
We will be using the txi object as is, for input into DESeq2 but will save it until the next lesson. For now let’s take a look at the count matrix. You will notice that there are decimal values, so let’s round to the nearest whole number and convert it into a dataframe. We wil save it to a variable called data that we can play with.
# Look at the counts
txi$counts %>% View()
# Write the counts to an object
data <- txi$counts %>%
round() %>%
data.frame()
Creating metadata
Of great importance is keeping track of the information about our data. At minimum, we need to atleast have a file which maps our samples to the corresponding sample groups that we are investigating. We will use the columns headers from the counts matrix as the row names of our metadata file and have single column to identify each sample as “MOV10_overexpression”, “MOV10_knockdown”, or “control”.
## Create a sampletable/metadata
sampletype <- factor(c(rep("control",3), rep("MOV10_knockdown", 2), rep("MOV10_overexpression", 3)))
meta <- data.frame(sampletype, row.names = colnames(txi$counts))
2.DGE analysis workflow
Count normalization of Mov10 dataset using DESeq2
Now that we know the theory of count normalization, we will normalize the counts for the Mov10 dataset using DESeq2. This requires a few steps:
- Ensure the row names of the metadata dataframe are present and in the same order as the column names of the counts dataframe.
- Create a DESeqDataSet object
- Generate the normalized counts
1. Match the metadata and counts data
### Check that sample names match in both files
all(colnames(txi$counts) %in% rownames(meta))
all(colnames(txi$counts) == rownames(meta))
2.Create DESEq2 object
Bioconductor software packages often define and use a custom class within R for storing data (input data, intermediate data and also results). These custom data structures are similar to lists in that they can contain multiple different data types/structures within them. But, unlike lists they have pre-specified data slots, which hold specific types/classes of data. The data stored in these pre-specified slots can be accessed by using specific package-defined functions.
Let’s start by creating the DESeqDataSet object and then we can talk a bit more about what is stored inside it. To create the object we will need the count matrix and the metadata table as input. We will also need to specify a design formula. The design formula specifies the column(s) in the metadata table and how they should be used in the analysis. For our dataset we only have one column we are interested in, that is ~sampletype. This column has three factor levels, which tells DESeq2 that for each gene we want to evaluate gene expression change with respect to these different levels.
Our count matrix input is stored inside the txi list object, and so we pass that in using the DESeqDataSetFromTximport() function which will extract the counts component and round the values to the nearest whole number.
## Create DESeq2Dataset object
dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ sampletype)
3. Generate the Mov10 normalized counts
To perform the median of ratios method of normalization, DESeq2 has a single estimateSizeFactors() function that will generate size factors for us. We will use the function in the example below, but in a typical RNA-seq analysis this step is automatically performed by the DESeq() function, which we will see later.
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
Now, to retrieve the normalized counts matrix from dds, we use the counts() function and add the argument normalized=TRUE.
normalized_counts <- counts(dds, normalized=TRUE)
We can save this normalized data matrix to file for later use:
write.table(normalized_counts, file="data/normalized_counts.txt", sep="\t", quote=F, col.names=NA)
NOTE: ==DESeq2 doesn’t actually use normalized counts, rather it uses the raw counts and models the normalization inside the Generalized Linear Model (GLM). These normalized counts will be useful for downstream visualization of results, but cannot be used as input to DESeq2 or any other tools that peform differential expression analysis which use the negative binomial model==.
Quality Control
The next step in the DESeq2 workflow is QC, which includes sample-level and gene-level steps to perform QC checks on the count data to help us ensure that the samples/replicates look good.
1.Mov10 quality assessment and exploratory analysis using DESeq2
Transform normalized counts using the rlog transformation
To improve the distances/clustering for the PCA and heirarchical clustering visualization methods, we need to moderate the variance across the mean by applying the rlog transformation to the normalized counts.
### Transform counts for data visualization
rld <- rlog(dds, blind=TRUE)
The blind=TRUE argument results in a transformation unbiased to sample condition information. When performing quality assessment, it is important to include this option. The DESeq2 vignette has more details.
The rlog function returns a DESeqTransform object, another type of DESeq-specific object. The reason you don’t just get a matrix of transformed values is because all of the parameters (i.e. size factors) that went into computing the rlog transform are stored in that object. We use this object to plot the PCA and heirarchical clustering figures for quality assessment.
NOTE:The rlog() funtion can be a bit slow when you have e.g. > 20 samples. In these situations the vst() function is much faster and performs a similar transformation appropriate for use with plotPCA(). It’s typically just a few seconds with vst() due to optimizations and the nature of the transformation.
2.Principal components analysis (PCA)
DESeq2 has a built-in function for plotting PCA plots, that uses ggplot2 under the hood. This is great because it saves us having to type out lines of code and having to fiddle with the different ggplot2 layers. In addition, it takes the rlog object as an input directly, hence saving us the trouble of extracting the relevant information from it.
The function plotPCA() requires two arguments as input: an rlog object and the intgroup (the column in our metadata that we are interested in).
### Plot PCA
plotPCA(rld, intgroup="sampletype")
==NOTE: The plotPCA() function will only return the values for PC1 and PC2. If you would like to explore the additional PCs in your data or if you would like to identify genes that contribute most to the PCs, you can use the prcomp() function. For example, to plot any of the PCs we could run the following code:==
# Input is a matrix of log transformed values
rld <- rlog(dds, blind=T)
rld_mat <- assay(rld)
pca <- prcomp(t(rld_mat))
# Create data frame with metadata and PC3 and PC4 values for input to ggplot
df <- cbind(meta, pca$x)
ggplot(df) + geom_point(aes(x=PC3, y=PC4, color = sampletype))
3.Hierarchical Clustering
Since there is no built-in function for heatmaps in DESeq2 we will be using the pheatmap() function from the pheatmap package. This function requires a matrix/dataframe of numeric values as input, and so the first thing we need to is retrieve that information from the rld object:
### Extract the rlog matrix from the object
rld_mat <- assay(rld) ## assay() is function from the "SummarizedExperiment" package that was loaded when you loaded DESeq2
Then we need to compute the pairwise correlation values for samples. We can do this using the cor() function:
### Compute pairwise correlation values
rld_cor <- cor(rld_mat) ## cor() is a base R function
head(rld_cor) ## check the output of cor(), make note of the rownames and colnames
And now to plot the correlation values as a heatmap:
### Load pheatmap package
library(pheatmap)
### Plot heatmap
pheatmap(rld_cor, annotation = meta)
3.Differential expression analysis with DESeq2
The final step in the differential expression analysis workflow is fitting the raw counts to the NB model and performing the statistical test for differentially expressed genes. In this step we essentially want to determine whether the mean expression levels of different sample groups are significantly different.
Running DESeq2
Prior to performing the differential expression analysis, it is a good idea to know what sources of variation are present in your data, either by exploration during the QC and/or prior knowledge. Once you know the major sources of variation, you can remove them prior to analysis or control for them in the statistical model by including them in your design formula.
If you want to examine the expression differences between treatments, and you know that major sources of variation include sex and age, then your design formula would be:
design <- ~ sex + age + treatment
Complex designs
DESeq2 also allows for the analysis of complex designs. You can explore interactions or ‘the difference of differences’ by specifying for it in the design formula. For example, if you wanted to explore the effect of sex on the treatment effect, you could specify for it in the design formula as follows:
design <- ~ sex + age + treatment + sex:treatment
MOV10 DE analysis
To get our differential expression results from our raw count data, we only need to run 2 lines of code!
First we create a DESeqDataSet as we did in the ‘Count normalization’ lesson and specify the txi object which contains our raw counts, the metadata variable, and provide our design formula:
## Create DESeq2Dataset object
dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ sampletype)
Then, to run the actual differential expression analysis, we use a single call to the function DESeq().
## Run analysis
dds <- DESeq(dds)
MOV10 Differential Expression Analysis: Control versus Overexpression
We have three sample classes so we can make three possible pairwise comparisons:
- Control vs. Mov10 overexpression
- Control vs. Mov10 knockdown
- Mov10 knockdown vs. Mov10 overexpression
We are really only interested in #1 and #2 from above. Using the design formula we provided ~ sampletype, indicating that this is our main factor of interest.
Creating contrasts
To indicate to DESeq2 the two groups we want to compare, we can use contrasts. Contrasts are then provided to DESeq2 to perform differential expression testing using the Wald test. Contrasts can be provided to DESeq2 a couple of different ways:
Do nothing. Automatically DESeq2 will use the base factor level of the condition of interest as the base for statistical testing. The base level is chosen based on alphabetical order of the levels.
In the results() function you can specify the comparison of interest, and the levels to compare. The level given last is the base level for the comparison. The syntax is given below:
# DO NOT RUN!
contrast <- c("condition", "level_to_compare", "base_level")
results(dds, contrast = contrast, alpha = alpha_threshold)
==NOTE: The Wald test can also be used with continuous variables. If the variable of interest provided in the design formula is continuous-valued, then the reported log2 fold change is per unit of change of that variable.==
Building the results table
To build our results table we will use the results() function. To tell DESeq2 which groups we wish to compare, we supply the contrasts we would like to make using thecontrast argument
## Define contrasts, extract results table, and shrink the log2 fold changes
contrast_oe <- c("sampletype", "MOV10_overexpression", "control")
res_tableOE <- results(dds, contrast=contrast_oe, alpha = 0.05)
Results exploration
The results table looks very much like a dataframe and in many ways it can be treated like one (i.e when accessing/subsetting data). However, it is important to recognize that it is actually stored in a DESeqResults object. When we start visualizing our data, this information will be helpful.
class(res_tableOE)
mcols(res_tableOE, use.names=T)
res_tableOE %>% data.frame() %>% View()
==NOTE: on p-values set to NA==
- If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA.
- If a row contains a sample with an extreme count outlier then the p-value and adjusted p-value will be set to NA. These outlier counts are detected by Cook’s distance.
- If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p-value will be set to NA.
Shrunken log2 foldchanges (LFC)
To generate more accurate log2 foldchange estimates, DESeq2 allows for the shrinkage of the LFC estimates toward zero when the information for a gene is low, which could include:
- Low counts
- High dispersion values
To generate the shrunken log2 fold change estimates, you have to run an additional step on your results object (that we will create below) with the function lfcShrink().
## Save the unshrunken results to compare
res_tableOE_unshrunken <- res_tableOE
# Apply fold change shrinkage
res_tableOE <- lfcShrink(dds, contrast=contrast_oe, res=res_tableOE)
==NOTE: Shrinking the log2 fold changes will not change the total number of genes that are identified as significantly differentially expressed. The shrinkage of fold change is to help with downstream assessment of results. For example, if you wanted to subset your significant genes based on fold change for further evaluation, you may want to use shruken values. Additionally, for functional analysis tools such as GSEA which require fold change values as input you would want to provide shrunken values.==
MA Plot
A plot that can be useful to exploring our results is the MA plot. The MA plot shows the mean of the normalized counts versus the log2 foldchanges for all genes tested. The genes that are significantly DE are colored to be easily identified. This is also a great way to illustrate the effect of LFC shrinkage. The DESeq2 package offers a simple function to generate an MA plot.
Let’s start with the unshrunken results:
plotMA(res_tableOE_unshrunken, ylim=c(-2,2))
MOV10 Differential Expression Analysis: Control versus Knockdown
Now that we have results for the overexpression results, let’s do the same for the Control vs. Knockdown samples. Use contrasts in the results() to extract a results table and store that to a variable called** res_tableKD**.
## Define contrasts, extract results table and shrink log2 fold changes
contrast_kd <- c("sampletype", "MOV10_knockdown", "control")
res_tableKD <- results(dds, contrast=contrast_kd, alpha = 0.05)
res_tableKD <- lfcShrink(dds, contrast=contrast_kd, res=res_tableKD)
Summarizing results
Summary of differential expression analysis workflow
To summarize the results table, a handy function in DESeq2 is summary(). Confusingly it has the same name as the function used to inspect data frames. This function when called with a DESeq results table as input, will summarize the results using the alpha threshold: FDR < 0.05 (padj/FDR is used even though the output says p-value < 0.05). Let’s start with the OE vs control results:
## Summarize results
summary(res_tableOE, alpha = 0.05)
Extracting significant differentially expressed genes
Let’s first create variables that contain our threshold criteria
### Set thresholds
padj.cutoff <- 0.05
We can easily subset the results table to only include those that are significant using the filter() function, but first we will convert the results table into a tibble:
res_tableOE_tb <- res_tableOE %>%
data.frame() %>%
rownames_to_column(var="gene") %>%
as_tibble()
Now we can subset that table to only keep the significant genes using our pre-defined thresholds:
sigOE <- res_tableOE_tb %>%
filter(padj < padj.cutoff)
Adding a fold change threshold:
With large significant gene lists it can be hard to extract meaningful biological relevance. To help increase stringency, one can also add a fold change threshold.
For e.g., we can create a new threshold lfc.cutoff and set it to 0.58 (remember that we are working with log2 fold changes so this translates to an actual fold change of 1.5).
lfc.cutoff <- 0.58
sigOE <- res_tableOE_tb %>% filter(padj < padj.cutoff & abs(log2FoldChange) > lfc.cutoff)
==An alternative approach to add the fold change threshold:==
==The results() function has an option to add a fold change threshold using the lfcThrehsold argument. This method is more statistically motivated, and is recommended when you want a more confident set of genes based on a certain fold-change. It actually performs a statistical test against the desired threshold, by performing a two-tailed test for log2 fold changes greater than the absolute value specified. The user can change the alternative hypothesis using altHypothesis and perform two one-tailed tests as well. This is a more conservative approach, so expect to retrieve a much smaller set of genes!==
results(dds, contrast = contrast_oe, alpha = 0.05, lfcThreshold = 0.58)
4.Visualizing the results
We will be working with three different data objects we have already created
- Metadata for our samples (a dataframe): meta
- Normalized expression data for every gene in each of our samples (a matrix): normalized_counts
- Tibble versions of the DESeq2 results we generated in the last lesson: res_tableOE_tb and res_tableKD_tb
First, let’s create a metadata tibble from the data frame (don’t lose the row names!)
mov10_meta <- meta %>%
rownames_to_column(var="samplename") %>%
as_tibble()
Next, let’s bring in a column with gene symbols to the normalized_counts object, so we can use them to label our plots. Ensembl IDs are great for many things, but the gene symbols are much more recognizable to us, as biologists.
# DESeq2 creates a matrix when you use the counts() function
## First convert normalized_counts to a data frame and transfer the row names to a new column called "gene"
normalized_counts <- counts(dds, normalized=T) %>%
data.frame() %>%
rownames_to_column(var="gene")
# Next, merge together (ensembl IDs) the normalized counts data frame with a subset of the annotations in the tx2gene data frame (only the columns for ensembl gene IDs and gene symbols)
grch38annot <- tx2gene %>%
dplyr::select(ensgene, symbol) %>%
dplyr::distinct()
## This will bring in a column of gene symbols
normalized_counts <- merge(normalized_counts, grch38annot, by.x="gene", by.y="ensgene")
# Now create a tibble for the normalized counts
normalized_counts <- normalized_counts %>%
as_tibble()
normalized_counts
1).Plotting signicant DE genes
Using DESeq2 plotCounts() to plot expression of a single gene
# Find the Ensembl ID of MOV10
grch38annot[grch38annot$symbol == "MOV10", "ensgene"]
# Plot expression for single gene
plotCounts(dds, gene="ENSG00000155363", intgroup="sampletype")
Using ggplot2 to plot expression of a single gene
# Save plotcounts to a data frame object
d <- plotCounts(dds, gene="ENSG00000155363", intgroup="sampletype", returnData=TRUE)
# What is the data output of plotCounts()?
d %>% View()
# Plot the MOV10 normalized counts, using the samplenames (rownames(d) as labels)
ggplot(d, aes(x = sampletype, y = count, color = sampletype)) +
geom_point(position=position_jitter(w = 0.1,h = 0)) +
geom_text_repel(aes(label = rownames(d))) +
theme_bw() +
ggtitle("MOV10") +
theme(plot.title = element_text(hjust = 0.5))
Heatmap
In addition to plotting subsets, we could also extract the normalized values of all the significant genes and plot a heatmap of their expression using pheatmap().
### Extract normalized expression for significant genes from the OE and control samples (2:4 and 7:9)
norm_OEsig <- normalized_counts[,c(1:4,7:9)] %>%
filter(gene %in% sigOE$gene)
Now let’s draw the heatmap using pheatmap:
### Set a color palette
heat_colors <- brewer.pal(6, "YlOrRd")
### Run pheatmap using the metadata data frame for the annotation
pheatmap(norm_OEsig[2:7],
color = heat_colors,
cluster_rows = T,
show_rownames = F,
annotation = meta,
border_color = NA,
fontsize = 10,
scale = "row",
fontsize_row = 10,
height = 20)
NOTE: There are several additional arguments we have included in the function for aesthetics. ==One important one is scale="row", in which Z-scores are plotted, rather than the actual normalized count value.==
Z-scores are computed on a gene-by-gene basis by subtracting the mean and then dividing by the standard deviation. The Z-scores are computed after the clustering, so that it only affects the graphical aesthetics and the color visualization is improved.
Volcano plot
The above plot would be great to look at the expression levels of a good number of genes, but for more of a global view there are other plots we can draw. A commonly used one is a volcano plot; in which you have the log transformed adjusted p-values plotted on the y-axis and log2 fold change values on the x-axis.
To generate a volcano plot, we first need to have a column in our results data indicating whether or not the gene is considered differentially expressed based on p-adjusted values and we will include a log2fold change here.
## Obtain logical vector where TRUE values denote padj values < 0.05 and fold change > 1.5 in either direction
res_tableOE_tb <- res_tableOE_tb %>%
mutate(threshold_OE = padj < 0.05 & abs(log2FoldChange) >= 0.58)
Now we can start plotting. The geom_point object is most applicable, as this is essentially a scatter plot:
## Volcano plot
ggplot(res_tableOE_tb) +
geom_point(aes(x = log2FoldChange, y = -log10(padj), colour = threshold_OE)) +
ggtitle("Mov10 overexpression") +
xlab("log2 fold change") +
ylab("-log10 adjusted p-value") +
#scale_y_continuous(limits = c(0,50)) +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.5), hjust = 0.5),
axis.title = element_text(size = rel(1.25)))
This is a great way to get an overall picture of what is going on, but what if we also wanted to know where the top 10 genes (lowest padj) in our DE list are located on this plot? We could label those dots with the gene name on the Volcano plot using geom_text_repel().
First, we need to order the res_tableOE tibble by padj, and add an additional column to it, to include on those gene names we want to use to label the plot.
## Add all the gene symbols as a column from the grch38 table using bind_cols()
res_tableOE_tb <- bind_cols(res_tableOE_tb, symbol=grch38annot$symbol[match(res_tableOE_tb$gene, grch38annot$ensgene)])
## Create an empty column to indicate which genes to label
res_tableOE_tb <- res_tableOE_tb %>% mutate(genelabels = "")
## Sort by padj values
res_tableOE_tb <- res_tableOE_tb %>% arrange(padj)
## Populate the genelabels column with contents of the gene symbols column for the first 10 rows, i.e. the top 10 most significantly expressed genes
res_tableOE_tb$genelabels[1:10] <- as.character(res_tableOE_tb$symbol[1:10])
View(res_tableOE_tb)
Next, we plot it as before with an additional layer for geom_text_repel() wherein we can specify the column of gene labels we just created.
ggplot(res_tableOE_tb, aes(x = log2FoldChange, y = -log10(padj))) +
geom_point(aes(colour = threshold_OE)) +
geom_text_repel(aes(label = genelabels)) +
ggtitle("Mov10 overexpression") +
xlab("log2 fold change") +
ylab("-log10 adjusted p-value") +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.5), hjust = 0.5),
axis.title = element_text(size = rel(1.25)))
NOTE: If using the DESeq2 tool for differential expression analysis, the package ‘==DEGreport==’ can use the DESeq2 results output to make the top20 genes and the volcano plots generated above by writing a few lines of simple code. While you can customize the plots above, you may be interested in using the easier code. Below are examples of the code to create these plots:
DEGreport::degPlot(dds = dds, res = res, n = 20, xs = "type", group = "condition") # dds object is output from DESeq2
DEGreport::degVolcano(
data.frame(res[,c("log2FoldChange","padj")]), # table - 2 columns
plot_text = data.frame(res[1:10,c("log2FoldChange","padj","id")])) # table to add names
# Available in the newer version for R 3.4
DEGreport::degPlotWide(dds = dds, genes = row.names(res)[1:5], group = "condition")
2)Hypothesis testing: Likelihood ratio test (LRT)
DESeq2 also offers the Likelihood Ratio Test as an alternative when evaluating expression change across more than two levels. This type of test can be especially useful in analyzing time course experiments.
To use the LRT, we use the DESeq() function but this time adding two arguments:
- specifying that we want to use the LRT test
- the ‘reduced’ model
library(DESeq2)
library(DEGreport)
# The full model was specified previously with the `design = ~ sampletype`:
# dds <- DESeqDataSetFromTximport(txi, colData = meta, ~ sampletype)
# Likelihood ratio test
dds_lrt <- DESeq(dds, test="LRT", reduced = ~ 1)
Since our ‘full’ model only has one factor (sampletype), the ‘reduced’ model is just the intercept (~ 1).
Generally, this test will result in a larger number of genes than the individual pair-wise comparisons. While the LRT is a test of significance for differences of any level of the factor, one should not expect it to be exactly equal to the union of sets of genes using Wald tests (although we do expect a majority overlap).
Let’s take a look at the results table:
# Extract results
res_LRT <- results(dds_lrt)
You will find that similar columns are reported for the LRT test. One thing to note is, even though there are fold changes present they are not directly associated with the actual hypothesis test. Thus, when filtering significant genes from the LRT we use only the FDR as our threshold. How many genes are significant at padj < 0.05?
# Create a tibble for LRT results
res_LRT_tb <- res_LRT %>%
data.frame() %>%
rownames_to_column(var="gene") %>%
as_tibble()
# Subset to return genes with padj < 0.05
sigLRT_genes <- res_LRT_tb %>%
filter(padj < padj.cutoff)
# Get number of significant genes
nrow(sigLRT_genes)
# Compare to numbers we had from Wald test
nrow(sigOE)
nrow(sigKD)
Identifying gene clusters exhibiting particular patterns across samples
Often we are interested in genes that have ==particular patterns across the sample groups (levels) of our condition==. For example, with the MOV10 dataset, we may be interested in genes that exhibit the lowest expression for the Mov10_KD and highest expression for the Mov10_OE sample groups (i.e. KD < CTL < OE). ==To identify genes associated with these patterns we can use a clustering tool, degPatterns from the ‘DEGreport’ package, that groups the genes based on their changes in expression across sample groups.==
First we will subset our rlog transformed normalized counts to contain only the differentially expressed genes (padj < 0.05).
# Subset results for faster cluster finding (for classroom demo purposes)
clustering_sig_genes <- sigLRT_genes %>%
arrange(padj) %>%
head(n=1000)
# Obtain rlog values for those significant genes
cluster_rlog <- rld_mat[clustering_sig_genes$gene, ]
==Then we can use the degPatterns function from the ‘DEGreport’ package to determine sets of genes that exhibit similar expression patterns across sample groups.== The degPatterns tool uses a hierarchical clustering approach based on pair-wise correlations, then cuts the hierarchical tree to generate groups of genes with similar expression profiles. The tool cuts the tree in a way to optimize the diversity of the clusters, such that the variability inter-cluster > the variability intra-cluster.
# Use the `degPatterns` function from the 'DEGreport' package to show gene clusters across sample groups
clusters <- degPatterns(cluster_rlog, metadata = meta, time = "sampletype", col=NULL)
Let’s explore the output:
# What type of data structure is the `clusters` output?
class(clusters)
# Let's see what is stored in the `df` component
head(clusters$df)
Let’s explore the set of genes in Group 1 in more detail:
# Extract the Group 1 genes
cluster_groups <- clusters$df
group1 <- clusters$df %>%
filter(cluster == 1)
After extracting a group of genes, we can perform ==functional analysis to explore associated functions==. We can repeat this extraction and functional analysis for any of the groups of interest.
Time course analyses with LRT
5.Genomic annotations
The analysis of next-generation sequencing results requires associating genes, transcripts, proteins, etc. with functional or regulatory information. To perform functional analysis on gene lists, we often need to obtain gene identifiers that are compatible with the tools we wish to use and this is not always trivial. Here, we discuss ways in which you can obtain gene annotation information and some of the advantages and disadvantages of each method.
Databases
General databases
Offer comprehensive information on genome features, feature coordinates, homology, variant information, phenotypes, protein domain/family information, associated biological processes/pathways, associated microRNAs, etc.:
- Ensembl (use Ensembl gene IDs)
- NCBI (use Entrez gene IDs)
- UCSC
- EMBL-EBI
Annotation-specific databases
Provide annotations related to a specific topic:
- Gene Ontology (GO): database of gene ontology biological processes, cellular components and molecular functions - based on Ensembl or Entrez gene IDs or official gene symbols
- KEGG: database of biological pathways - based on Entrez gene IDs
- MSigDB: database of gene sets
- Reactome: database of biological pathways
- Human Phenotype Ontology: database of genes associated with human disease
- CORUM: database of protein complexes for human, mouse, rat
…
Before you begin your search through any of these databases, you should know which build of the genome was used to generate your gene list and make sure you use the same build for the annotations during functional analysis. When a new genome build is acquired, the names and/or coordinate location of genomic features (gene, transcript, exon, etc.) may change. Therefore, the annotations regarding genome features (gene, transcript, exon, etc.) is genome-build specific and we need to make sure that our annotations are obtained from the appropriate resource.
For example, if we used the GRCh38 build of the human genome to quantify gene expression used for differential expression analysis, then we should use the same GRCh38 build of the genome to convert between gene IDs and to identify annotations for each of the genes.
Tools for accessing databases
Annotation tools: for accessing/querying annotations from a specific databases
Interface tools: for accessing/querying annotations from multiple different annotation sources
- AnnotationDbi: queries the OrgDb, TxDb, Go.db, EnsDb, and BioMart annotations.
- AnnotationHub: queries large collection of whole genome resources, including ENSEMBL, UCSC, ENCODE, Broad Institute, KEGG, NIH Pathway Interaction Database, etc.
==NOTE: These are both packages that can be used to create the tx2gene files we had you download at the beginning of this workshop.==
AnnotationHub
AnnotationHub is a wonderful resource for accessing genomic data or querying large collection of whole genome resources, including ENSEMBL, UCSC, ENCODE, Broad Institute, KEGG, NIH Pathway Interaction Database, etc. All of this information is stored and easily accessible by directly connecting to the database.
To get started with AnnotationHub, we first load the library and connect to the database:
# Load libraries
library(AnnotationHub)
library(ensembldb)
# Connect to AnnotationHub
ah <- AnnotationHub()
What is a cache?
A cache is used in R to store data or a copy of the data so that future requests can be served faster without having to re-run a lengthy computation.
The AnnotationHub() command creates a client that manages a local cache of the database, helping with quick and reproducible access. When encountering question AnnotationHub does not exist, create directory?, you can anwser either yes (create a permanant location to store cache) or no (create a temporary location to store cache). hubCache(ah) gets the file system location of the local AnnotationHub cache. hubUrl(ah) gets the URL for the online hub.
6.Functional analysis
The output of RNA-seq differential expression analysis is a list of significant differentially expressed genes (DEGs). To gain greater biological insight on the differentially expressed genes there are various analyses that can be done
- determine whether there is enrichment of known biological functions, interactions, or pathways
- identify genes’ involvement in novel pathways or networks by grouping genes together based on similar trends
- use global changes in gene expression by visualizing all genes being significantly up- or down-regulated in the context of external interaction data
Generally for any differential expression analysis, it is useful to interpret the resulting gene lists using freely available web- and R-based tools. While tools for functional analysis span a wide variety of techniques, they can loosely be categorized into three main types: over-representation analysis, functional class scoring, and pathway topology
clusterProfiler
We will be using clusterProfiler to perform over-representation analysis on GO terms associated with our list of significant genes. The tool takes as input a significant gene list and a background gene list and performs statistical enrichment analysis using hypergeometric testing. The basic arguments allow the user to select the appropriate organism and GO ontology (BP, CC, MF) to test.
Running clusterProfiler
To run clusterProfiler GO over-representation analysis, we will change our gene names into Ensembl IDs, since the tool works a bit easier with the Ensembl IDs.
Then load the following libraries:
# Load libraries
library(DOSE)
library(pathview)
library(clusterProfiler)
For the different steps in the functional analysis, we require Ensembl and Entrez IDs. We will use the gene annotations that we generated previously to merge with our differential expression results.
## Merge the AnnotationHub dataframe with the results
res_ids <- inner_join(res_tableOE_tb, annotations_ahb, by=c("gene"="gene_id"))
To perform the over-representation analysis, we need a list of background genes and a list of significant genes. For our background dataset we will use all genes tested for differential expression (all genes in our results table). For our significant gene list we will use genes with p-adjusted values less than 0.05 (we could include a fold change threshold too if we have many DE genes).
## Create background dataset for hypergeometric testing using all genes tested for significance in the results
allOE_genes <- as.character(res_ids$gene)
## Extract significant results
sigOE <- dplyr::filter(res_ids, padj < 0.05)
sigOE_genes <- as.character(sigOE$gene)
Now we can perform the GO enrichment analysis and save the results:
## Run GO enrichment analysis
ego <- enrichGO(gene = sigOE_genes,
universe = allOE_genes,
keyType = "ENSEMBL",
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
qvalueCutoff = 0.05,
readable = TRUE)
NOTE: The different organisms with annotation databases available to use with for the OrgDb argument can be found here.
Also, the keyType argument may be coded as keytype in different versions of clusterProfiler.
Finally, the ont argument can accept either “BP” (Biological Process), “MF” (Molecular Function), and “CC” (Cellular Component) subontologies, or “ALL” for all three.
## Output results from GO analysis to a table
cluster_summary <- data.frame(ego)
write.csv(cluster_summary, "results/clusterProfiler_Mov10oe.csv")
NOTE: Instead of saving just the results summary from the ego object, it might also be beneficial to save the object itself. The save() function enables you to save it as a .rda file, e.g. save(ego, file="results/ego.rda"). The complementary function to save() is the function load(), e.g. ego <- load(file="results/ego.rda").
This is a useful set of functions to know, since it enables one to preserve analyses at specific stages and reload them when needed. More information about these functions can be found here & here.
Visualizing clusterProfiler results
clusterProfiler has a variety of options for viewing the over-represented GO terms. We will explore the dotplot, enrichment plot, and the category netplot.
The dotplot shows the number of genes associated with the first 50 terms (size) and the p-adjusted values for these terms (color). This plot displays the top 50 GO terms by gene ratio (# genes related to GO term / total number of sig genes), not p-adjusted value.
## Dotplot
dotplot(ego, showCategory=50)
Orientation: to Landscape
PDF size to 8 x 14 to give a figure of appropriate size for the text labels
The next plot is the enrichment GO plot, which shows the relationship between the top 50 most significantly enriched GO terms (padj.), by grouping similar terms together. The color represents the p-values relative to the other displayed terms (brighter red is more significant) and the size of the terms represents the number of genes that are significant from our list.
## Enrichmap clusters the 50 most significant (by padj) GO terms to visualize relationships between terms
emapplot(ego, showCategory = 50)
To save the figure, click on the Export button in the RStudio Plots tab and Save as PDF.... In the pop-up window, change the PDF size to 12 x 14 to give a figure of appropriate size for the text labels.
Finally, the category netplot shows the relationships between the genes associated with the top five most significant GO terms and the fold changes of the significant genes associated with these terms (color). The size of the GO terms reflects the pvalues of the terms, with the more significant terms being larger. This plot is particularly useful for hypothesis generation in identifying genes that may be important to several of the most affected processes.
## To color genes by log2 fold changes, we need to extract the log2 fold changes from our results table creating a named vector
OE_foldchanges <- sigOE$log2FoldChange
names(OE_foldchanges) <- sigOE$gene
## Cnetplot details the genes associated with one or more terms - by default gives the top 5 significant terms (by padj)
cnetplot(ego,
categorySize="pvalue",
showCategory = 5,
foldChange=OE_foldchanges,
vertex.label.font=6)
## If some of the high fold changes are getting drowned out due to a large range, you could set a maximum fold change value
OE_foldchanges <- ifelse(OE_foldchanges > 2, 2, OE_foldchanges)
OE_foldchanges <- ifelse(OE_foldchanges < -2, -2, OE_foldchanges)
cnetplot(ego,
categorySize="pvalue",
showCategory = 5,
foldChange=OE_foldchanges,
vertex.label.font=6)
Again, to save the figure, click on the Export button in the RStudio Plots tab and Save as PDF.... Change the PDF size to 12 x 14 to give a figure of appropriate size for the text labels.
If you are interested in significant processes that are not among the top five, you can subset your ego dataset to only display these processes:
## Subsetting the ego results without overwriting original `ego` variable
ego2 <- ego
ego2@result <- ego@result[c(1,3,4,8,9),]
## Plotting terms of interest
cnetplot(ego2,
categorySize="pvalue",
foldChange=OE_foldchanges,
showCategory = 5,
vertex.label.font=6)
Functional class scoring tools
Functional class scoring (FCS) tools, such as GSEA , most often use the gene-level statistics or log2 fold changes for all genes from the differential expression results, then look to see whether gene sets for particular biological pathways are enriched among the large positive or negative fold changes.
The hypothesis of FCS methods is that although large changes in individual genes can have significant effects on pathways (and will be detected via ORA methods), weaker but coordinated changes in sets of functionally related genes (i.e., pathways) can also have significant effects. Thus, rather than setting an arbitrary threshold to identify ‘significant genes’, all genes are considered in the analysis. The gene-level statistics from the dataset are aggregated to generate a single pathway-level statistic and statistical significance of each pathway is reported. ==This type of analysis can be particularly helpful if the differential expression analysis only outputs a small list of significant DE genes==.
Gene set enrichment analysis using clusterProfiler and Pathview
Using the log2 fold changes obtained from the differential expression analysis for every gene, gene set enrichment analysis and pathway analysis can be performed using clusterProfiler and Pathview tools.
For gene set or pathway analysis using clusterProfiler, coordinated differential expression over gene sets is tested instead of changes of individual genes. “Gene sets are pre-defined groups of genes, which are functionally related. Commonly used gene sets include those derived from KEGG pathways, Gene Ontology terms, MSigDB, Reactome, or gene groups that share some other functional annotations, etc. Consistent perturbations over such gene sets frequently suggest mechanistic changes” .
To perform GSEA analysis of KEGG gene sets, clusterProfiler requires the genes to be identified using Entrez IDs for all genes in our results dataset. We also need to remove the NA values and duplicates (due to gene ID conversion) prior to the analysis:
## Remove any NA values (reduces the data by quite a bit)
res_entrez <- dplyr::filter(res_ids, entrezid != "NA")
## Remove any Entrez duplicates
res_entrez <- res_entrez[which(duplicated(res_entrez$entrezid) == F), ]
Finally, extract and name the fold changes:
## Extract the foldchanges
foldchanges <- res_entrez$log2FoldChange
## Name each fold change with the corresponding Entrez ID
names(foldchanges) <- res_entrez$entrezid
Next we need to order the fold changes in decreasing order. To do this we’ll use the sort() function, which takes a vector as input. This is in contrast to Tidyverse’s arrange(), which requires a data frame.
## Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)
head(foldchanges)
Perform the GSEA using KEGG gene sets:
## GSEA using gene sets from KEGG pathways
gseaKEGG <- gseKEGG(geneList = foldchanges, # ordered named vector of fold changes (Entrez IDs are the associated names)
organism = "hsa", # supported organisms listed below
nPerm = 1000, # default number permutations
minGSSize = 20, # minimum gene set size (# genes in set) - change to test more sets or recover sets with fewer # genes
pvalueCutoff = 0.05, # padj cutoff value
verbose = FALSE)
## Extract the GSEA results
gseaKEGG_results <- gseaKEGG@result
NOTE: The organisms with KEGG pathway information are here
How many pathways are enriched? View the enriched pathways:
## Write GSEA results to file
View(gseaKEGG_results)
write.csv(gseaKEGG_results, "results/gseaOE_kegg.csv", quote=F)
NOTE: We will all get different results for the GSEA because the permutations performed use random reordering. If we would like to use the same permutations every time we run a function (i.e. we would like the same results every time we run the function), then we could use the set.seed(123456) function prior to running. The input to set.seed() could be any number, but if you would want the same results, then you would need to use the same number as input.
Explore the GSEA plot of enrichment of one of the pathways in the ranked list:
## Plot the GSEA plot for a single enriched pathway, `hsa03040`
gseaplot(gseaKEGG, geneSetID = 'hsa03040')
Use the Pathview R package to integrate the KEGG pathway data from clusterProfiler into pathway images:
detach("package:dplyr", unload=TRUE) # first unload dplyr to avoid conflicts
## Output images for a single significant KEGG pathway
pathview(gene.data = foldchanges,
pathway.id = "hsa03040",
species = "hsa",
limit = list(gene = 2, # value gives the max/min limit for foldchanges
cpd = 1))
NOTE: Printing out Pathview images for all significant pathways can be easily performed as follows:
## Output images for all significant KEGG pathways
get_kegg_plots <- function(x) {
pathview(gene.data = foldchanges, pathway.id = gseaKEGG_results$ID[x], species = "hsa",
limit = list(gene = 2, cpd = 1))
}
purrr::map(1:length(gseaKEGG_results$ID), get_kegg_plots)
==Instead of exploring enrichment of KEGG gene sets, we can also explore the enrichment of BP Gene Ontology terms using gene set enrichment analysis:==
# GSEA using gene sets associated with BP Gene Ontology terms
gseaGO <- gseGO(geneList = foldchanges,
OrgDb = org.Hs.eg.db,
ont = 'BP',
nPerm = 1000,
minGSSize = 20,
pvalueCutoff = 0.05,
verbose = FALSE)
gseaGO_results <- gseaGO@result
gseaplot(gseaGO, geneSetID = 'GO:0007423')