Brief workflow of RNAseq analysis from Wei

1. General workflow for RNAseq analysis

After standard sequencing process (recommended for the illumina high-throughput sequencing protocol), raw data were obtained and processed with python (higher than 3.7 was recommended) for QC, Trimming, Gene alignment, and Expression count. Subsequently, R (higher than 4.2.0 was recommended) was used for the analysis of differential expressions, and web-based tools were used for annotation / enrichment analysis.

the general workflow for RNAseq analysis

2. Preparations

System requirement

Generally speaking, the linux system (ubuntu or deepin OS) was recommended for the python process, while anaconda or miniconda was recommended for the installation of python.
ubuntu: Enterprise Open Source and Linux | Ubuntu

Deepin OS: deepin.org
miniconda: Anaconda — USTC Mirror Help

Installation of Python environment

After the installation of conda environment, repository mirrors should be added following the instruction:
anaconda | 镜像站使用帮助 | 清华大学开源软件镜像站 | Tsinghua Open Source Mirror

Environment for the bioinformatics should be added to the conda.

### In linux terminal
$ conda create -y --name bioinfo

### Every time you want to do analysis, the environment **bioinfo** should be activated, and software should be installed under the **bioinfo** environment.
$ source activate bioinfo

Every time you want to do analysis, the environment bioinfo should be activated, and software should be installed under the bioinfo environment.

Following softwares should be installed for RNAseq raw-data processing:

$ conda install -c bioconda fastqc trim-galore hisat2 samtools subread

Genomic index download and build

Human and mouse comprehensive gene annotation (GTF) and transcript sequences (FASTA) could be downloaded via the following website:
GENCODE - Home page (gencodegenes.org)

Index could then be built via the following command:

$ hisat2_extract_exons.py genomic_annotation.gtf > genome.exon
$ hisat2_extract_splice_sites.py genomic_annotation.gtf > genome.ss
$ hisat2-build - p 16 --ss genome.ss --exon genome.exon genomic_fasta.fa genomic_annotation
### Building an index with the whole genome requires about 200 GB RAM

Installation of R and packages

R could be downloaded directly from the following website.
R: The R Project for Statistical Computing (r-project.org)
CRAN 镜像使用帮助 — USTC Mirror Help 文档

After the installation, R should be started and the following libraries should be installed.

install.packages("BiocManager", "pacman", "devtools")
### BiocManager is a repository tool for many packages; 
### pacman is a package for easy install and load of other packages;
### devtools is required for the install and use of some packages;

Then, add a ".Rprofile" file in your home directory as follow:

### In linux terminal
$ cd ~/
$ sudo nano .Rprofile
### Add the following commands for quick install and load of R packages:
library(pacman)
options(repos = c(USTC="https://mirrors.ustc.edu.cn/CRAN/"))
options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/")
### Save (ctrl + O) and quit (ctrl + Q)

The following packages were required for the following analysis step:

### In the R enviroment
> p_load(DESeq2, biomaRt, pheatmap, ggplot2, ggrepel, ggsci, org.Hs.eg.db, org.Mm.eg.db, tidyverse, clusterProfiler, GSVA, GSEA, STRINGdb, igraph, ggraph, GSEABase, msigdbr, limma)
### If errors occured during the installation of these packages, please follow the instructions to install dependencies for linux and/or R.

3. Raw data acquisition

General procedure for acquiring RNAseq raw data

4. Python analysis to get count-matrix of the RNAseq

### Quarlity control step
$ fastqc -t 12 -o ./ your_rawdata_name_1.fq.gz your_rawdata_name_2.fq.gz

### Trimming for adaptors (Not required)
$ trim_galore -q 20 --phred33 --stringency 3 --length 20 -e 0.1 --paired your_rawdata_name_1.fq.gz your_rawdata_name_2.fq.gz

### Alignment step
$ hisat2 -p 12 -x <GRCm39.genome> -1 <sample_id>_R1.fastq.gz -2 <sample_id>_R2.fastq.gz | samtools view -Sb > <sample_id>.bam

### Counting step
$ featureCounts -T 12 -p -t exon -g gene_id -a <gtf.gz> -o <sample_id>.txt <sample_id>.bam

### for Batch running
$ pluma process_batch.sh

### Add the following into process_batch.sh
#!/bin/bash
for i in name_1 _name_2 name_3...
do
  hisat2 -p 12 -x <GRCm39.genome> -1 <sample_id>_R1.fastq.gz -2 <sample_id>_R2.fastq.gz | samtools view -Sb > <sample_id>.bam;
done
files=$(ls *.bam)
featureCounts -T 12 -p -t exon -g gene_id -a <gtf.gz> -o <merged_count>.txt $files
### Save and quit

### Execute the process_batch.sh file as follow in the linux terminal:
$ bash process_batch.sh

5. R process for differential expression analysis

Count-Matrix merging from 'count' for each sample

setwd("<working.dir>)
p_load(tidyverse, biomaRt)
count <- read.table(file.choose(), header = T)
count$Geneid <- gsub("\\..*$", "", count$Geneid)
raw_col <- colnames(count)
raw_col <- gsub("Geneid", "ensembl_gene_id", raw_col)
raw_col <- gsub(".bam.gz", "", raw_col)
raw_col <- gsub("\\.", "_", raw_col)
raw_col <- gsub("_bam", "", raw_col)
colnames(count) <- raw_col
count <- count[,-2:-5]
kb <- count$Length / 1000
rpk <- count[,-1:-2] / kb
tpm <- t(t(rpk)/colSums(rpk) * 1e6)
tpm <- round(tpm, 2)
tpm <- cbind(count[,1:2], tpm)
p_load(biomaRt, tidyverse)
if(grepl("ENSG", count[1,1])){
  genometype = "hsapiens_gene_ensembl"} else {
    genometype = "mmusculus_gene_ensembl"} 
    ### only work for human and/or mouse;
mart <- useMart("ensembl", genometype)
gene_name <- getBM(
  attributes = c("ensembl_gene_id", "external_gene_name"), 
  filters = "ensembl_gene_id", 
  values = count$ensembl_gene_id, 
  mart = mart)
write.csv(gene_name, "GENE_INFO.csv", row.names = F)
count <- merge(gene_name[,1:2], count, by="ensembl_gene_id", sort = F)
count <- count[,-3]
tpm <- merge(gene_name[,1:2], tpm, by = "ensembl_gene_id", sort = F)
tpm <- tpm[,-3]
write.csv(count, "COUNT.csv", row.names = F)
write.csv(tpm, "TPM.csv", row.names = F)
count <- count[!duplicated(count$external_gene_name),]
rownames(count) <- count[,2]
count <- count[,-1:-2]
count <- filter_all(count, any_vars(. > 10))
write.csv(count, "COUNT_MATRIX.csv")
rm(list=ls())

(NOT REQUIRED) Additional manipulation

### batch effect removal
p_load(tidyverse, sva)
batch <- c(rep(1,3),rep(2,3)...)
  # multiple grouping could also be introduced by the group factors
  # group <- c(rep(1,6),rep(2,3),rep(3,3),rep(4,3),rep(5,3),rep(6,3),rep(7,3))
counts <- read.csv("count_matrix.csv", row.names = 1)
adj_counts <- as.data.frame(ComBat_seq(as.matrix(counts), batch = batch, group = group))

DEG analysis
Filtering count value less than 10 resulted in identical DEG analysis results, "COUNT_MATRIX.csv" is already after filtration.

An example for the determination of count filtering

#Preload
p_load(DESeq2, ggplot2, pheatmap, ggrepel, tidyverse, ggsci)
mydata <- read.csv("COUNT_MATRIX.csv", row.names=1)
countData <- mydata[, xx:xx] ### Specify the range of data for analysis

#To build DESeq matrix and filter out genes with count < 10
condition <- factor(gsub("_[0-9]$", "", colnames(countData)))
colData <- data.frame(row.names = colnames(countData), condition)
dds <- DESeqDataSetFromMatrix(countData, colData, design=~condition)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ] 

# For paired-test:
condition <- factor(gsub("_[0-9]$", "", colnames(countData)))
subject <- factor(c(1,2,3,1,2,3))
colData <- data.frame(row.names = colnames(countData), condition, subject)
dds <- DESeqDataSetFromMatrix(countData, colData, design = ~subject +condition)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ] 

# Draw PCA analysis result
rld <- rlog(dds, blind=T)
p <- plotPCA(rld)
pcaData <- p$data
median <- pcaData %>% 
  group_by(group) %>% 
  summarize(X1=median(PC1), X2 = median(PC2))
xlab_text <- p$labels$x
ylab_text <- p$labels$y
p1 <- ggplot(pcaData, aes(PC1, PC2, color = group)) + 
  geom_point(size = 3) + 
  geom_text_repel(aes(X1, X2, label = group), data = median) + 
  xlab(xlab_text) + ylab(ylab_text) +
  theme_classic()
p1 + theme(legend.position = "none")

# Draw correlation analysis result
rld_mat <- assay(rld)
rld_cor <- cor(rld_mat)
pheatmap(rld_cor, border = F, cluster_rows = F, cluster_cols = F, angle_col = 45, 
  legend = F) 

# Draw DEG analysis result
dds <- DESeq(dds)
res <- results(dds)
resdata <- as.data.frame(res)
resdata$change = ifelse(resdata$padj < 0.05 & abs(resdata$log2FoldChange) >= 1, 
  ifelse(resdata$log2FoldChange>=1, 'Up', 'Down'), 'Stable')
resdata <- na.omit(resdata)
### resdata_sig <- subset(resdata, change != "Stable")
### resdata_sig <- resdata_sig %>% top_n(10, abs(log2FoldChange))
ggplot(resdata, aes(log2FoldChange, -log(padj, 10), colour = change)) + 
  geom_point(alpha=0.8, size = 3) + 
  scale_color_manual(values=c("blue", "grey","red")) + 
  xlab(expression("log"[2]*"FC")) + ylab(expression("-log"[10]*"FDR")) + 
  geom_vline(xintercept=c(-1,1),lty=4,col="lightgrey",lwd=0.8) + 
  geom_hline(yintercept = -log10(0.05),lty=4,col="lightgrey",lwd=0.8)  +
  ### geom_text_repel(aes(label=rownames(resdata_sig)), data=resdata_sig) +
  theme_classic()
table(resdata$change)

# Save DEG analysis results
resdata <- as.data.frame(results(dds))
write.csv(resdata, file="DESeq2_full_results.csv")
res <- subset(res, padj < 0.05 & (log2FoldChange >1 | log2FoldChange < -1))
resdata <- as.data.frame(res)
write.csv(resdata, file="DESeq2_DEGs_results.csv")

Color patterns usually used for 'ggplot2' and 'pheatmap'

----------cbPalette--------------------------------cbbPalette----------

# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", 
  "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# The palette with black:
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", 
  "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# The usually used colors:
color = c("red", "blue", "violet", "orange", "cyan", "brown", "navy", "firebrick3")

Correlation and PCA plot
Alternatively, correlation, PCA, and vocano plots could be draw directly from the dds matrix as follows:

p_load(FactoMineR, corrplot)
expr <- read.csv("tpm_nominated.csv", header=T, row.names=1)
corr <- cor(expr, method='pearson')
corrplot(corr, type='upper', tl.col='black', order='hclust', tl.srt=45, addCoef.col = 'white')
expr <- t(expr)
gene.pca <- PCA(expr, ncp=2, scale.unit=TRUE, graph=FALSE)
plot(gene.pca)

Plot the Upset diagram

p_load(UpSetR)
mydata <- read.csv("<DEGs.csv>") 
listInput <- list(group_1 = mydata$group_1, group_2 = mydata$group_2, ...)
result <- upset(fromList(listInput), sets = c("mock"...), 
  keep.order = T, number.angles = 30, point.size = 3.5, 
  line.size = 2, mainbar.y.label = "Intersections", sets.x.label = "DEG numbers", 
  text.scale = c(1.3, 1.3, 1, 1, 2, 1), mb.ratio = c(0.55, 0.45))

6. Geneset annotation or enrichment analysis

Enrichment analysis could be performed directly from the following websites:
Overview: DAVID: Functional Annotation Tools (ncifcrf.gov)
Web GSEA: WEB-based GEne SeT AnaLysis Toolkit
Gene annotation and analysis: Metascape
Results were more user-friendly and easier to be achieved than R did, thus these websites were highly recommended.
Alternatively, the following R scripts could also be used with many variations.
Genesets information are directly downloaded from MSigDB database via the msigdbr packages.
Detailed information is here: GSEA | MSigDB (gsea-msigdb.org)

List of subsets from the msigdbr database

GSEA analysis

p_load(tidyverse, GSVA, GSEABase, clusterProfiler, msigdbr, enrichplot, GseaVis)
KEGG_df = msigdbr(species = "Mus musculus", category = "C2", subcategory = "CP:KEGG") %>% 
  dplyr::select(gs_name, gene_symbol)
log2FC_matrix <- read.csv(file.choose()) 
log2FC_matrix <- na.omit(log2FC_matrix)
# The matrix contains external_gene_name and log2FC value
GSEA <- log2FC_matrix$log2FoldChange
names(GSEA) <- log2FC_matrix$X
GSEA <- sort(GSEA, decreasing = T)
GSEA_kegg <- GSEA(GSEA, TERM2GENE = KEGG_df, eps = 1e-10)
dotplotGsea(data = GSEA_kegg,topn = 10, order.by = 'NES', add.seg = T, base_size = 10)
write.csv(GSEA_kegg@result, "GSEA_result.csv")
gseaplot2(GSEA_kegg, geneSetID = 1:3, pvalue_table = TRUE, 
  color = c("#E495A5", "#86B875", "#7DB0DD"))
volcanoGsea(GSEA_kegg)

GO and KEGG Enrichment analysis

p_load(org.Hs.eg.db, clusterProfiler, ggplot2)
DEGs_entrezid <- mapIds(x=org.Hs.eg.db, column = "ENTREZID", 
  keys = as.character(mydata$DEGs), keytype = "ENSEMBL")
ALL <- enrichGO(gene=DEGs_entrezid, OrgDb=org.Hs.eg.db, 
  keyType = "ENTREZID", ont = 'ALL', pvalueCutoff = 0.01, 
  pAdjustMethod = "BH", qvalueCutoff = 0.05, readable=T)
barplot(ALL, split="ONTOLOGY", label_format=100) + 
  facet_grid(ONTOLOGY~.,scale="free")
write.table(as.data.frame(ALL), 'GO_enrichment.txt', sep = '\t', row.names = F, quote = F)

my_entrez_id <- mapIds(x=org.Hs.eg.db, column = "ENTREZID", 
  keys = as.character(mydata$down_genes), keytype = "ENSEMBL")
erich.kegg = enrichKEGG(gene = my_entrez_id, organism = "hsa", pAdjustMethod = 'BH', 
  pvalueCutoff = 0.01, qvalueCutoff = 0.05)
dotplot(erich.kegg)
write.table(as.data.frame(erich.kegg), 'KEGG_enrichment.txt', sep = '\t', row.names = F, quote = F)

7. GSVA analysis for analyzing geneset-level activities of multiple samples

count <- read.csv("count_matrix.csv", row.names = 1)
countData <- filter_all(count, any_vars(. > 10))

p_load(tidyverse, GSVA, GSEABase, clusterProfiler, msigdbr, limma, pheatmap, ggplot2, GSEABase, msigdbr, sva)
KEGG_df_all <-  msigdbr(species = "Mus musculus", category = "H") 
KEGG_df <- dplyr::select(KEGG_df_all,gs_name,gs_exact_source,gene_symbol)
kegg_list <- split(KEGG_df$gene_symbol, KEGG_df$gs_name)
GSVA_hall <- gsva(expr = as.matrix(count), gset.idx.list = kegg_list, 
  mx.diff = T, kcdf = "Poisson", parallel.sz = 16)
# kcdf should be set into "Gaussian" for CPM, RPKM, and TPM, while "Poisson" for read counts.
# "BiocParallel" package should be installed first for parallel calculation.
pheatmap(GSVA_hall, border = F, cluster_cols = F)

# "limma" package for the differential analysis of genesets.
list <- c(rep("CK", 3), rep("Treat",3)) %>% factor(., levels = c("CK", "Treat"), ordered = F)
list <- model.matrix(~factor(list)+0)
colnames(list) <- c("CK", "Treat")
df.fit <- lmFit(GSVA, list)
df.matrix <- makeContrasts(Treat - CK , levels = list)
fit <- contrasts.fit(df.fit, df.matrix)
fit <- eBayes(fit)
tempOutput <- topTable(fit,n = Inf, adjust = "fdr")
tempOutput$change <- ifelse(tempOutput$P.Value < 0.05 & abs(tempOutput$logFC) >= 1, 
  ifelse(tempOutput$logFC >= 1, 'Up', 'Down'), 'Stable')
tempOutput <- na.omit(tempOutput)
ggplot(tempOutput, aes(logFC, -log10(P.Value), colour = change)) + 
  geom_point(alpha=0.8, size = 3) + 
  scale_color_manual(values=c("blue","grey","red")) + 
  xlab(expression("log"[2]*"FC")) + ylab(expression("-log"[10]*"FDR")) + 
  geom_vline(xintercept=c(-1,1), lty=4, col="black", lwd=0.8) + 
  geom_hline(yintercept = -log10(0.05), lty=4, col="black", lwd=0.8)  +
  theme(legend.position = "none")
table(tempOutput$change)
tempOutput <- tempOutput[(tempOutput$change != "Stable"),]
write.csv(tempOutput, "GSVA_sig_xxx.csv")

"TCseq" could be used for clustering of GSVA results or "limma" could be used for differential analysis thereafter.

8. PPI analysis and graphing

p_load(tidyverse, clusterProfiler, org.Hs.eg.db, STRINGdb, igraph, ggraph)
string_db <- STRINGdb$new(version="11.5", species = 9606, score_threshold = 400)  
# 9606 for human and 10090 for mouse;
DEGs <- read.csv("DESeq2_results.csv", row.names = 1)
DEGs <- DEGs[(abs(DEGs$log2FoldChange) >= 1.5),]
gene <- rownames(DEGs)
gene <- gene %>% bitr(fromType = "SYMBOL", toType = "ENTREZID", 
  OrgDb = "org.Hs.eg.db", drop = T)
data_mapped <- gene %>% string_db$map(my_data_frame_id_col_names = "ENTREZID", 
  removeUnmappedRows = TRUE)
string_db$plot_network(data_mapped$STRING_id)

data_links <- data_mapped$STRING_id[1:100] %>% string_db$get_interactions()
links <- data_links %>%
  mutate(from = data_mapped[match(from, data_mapped$STRING_id), "SYMBOL"]) %>% 
  mutate(to = data_mapped[match(to, data_mapped$STRING_id), "SYMBOL"]) %>%  
  dplyr::select(from, to , last_col()) %>% 
  dplyr::rename(weight = combined_score)
nodes <- links %>% { data.frame(gene = c(.$from, .$to)) } %>% distinct()
net <- igraph::graph_from_data_frame(d=links,vertices=nodes,directed = F)
igraph::V(net)$deg <- igraph::degree(net) 
igraph::V(net)$size <- igraph::degree(net)/5 
igraph::E(net)$width <- igraph::E(net)$weight/10
ggraph(net,layout = "kk")+
  geom_edge_arc(aes(edge_width=width), color = "red4", show.legend = F)+
  geom_node_point(aes(size=size), color="blue2", alpha=0.7)+
  geom_node_text(aes(filter=deg>5, label=name), size = 5, repel = T)+
  scale_edge_width(range = c(0.2,1))+
  scale_size_continuous(range = c(1,10) )+
  guides(size=F)+
  theme_graph()
# ggraph(net,layout = "stress")+
# ggraph(net,layout = "linear", circular = TRUE)+


links_2 <- links %>% mutate(from_c = count(., from)$n[match(from, count(., from)$from)]) %>%
  mutate(to_c = count(., to)$n[match(to, count(., to)$to)]) %>%
  filter(!(from_c == 1 & to_c == 1)) %>%
  dplyr::select(1,2,3)
nodes_2 <- links_2 %>% { data.frame(gene = c(.$from, .$to)) } %>% distinct()
net_2 <- igraph::graph_from_data_frame(d=links_2,vertices=nodes_2,directed = F)
igraph::V(net_2)$deg <- igraph::degree(net_2)
igraph::V(net_2)$size <- igraph::degree(net_2)/5
igraph::E(net_2)$width <- igraph::E(net_2)$weight/10

ggraph(net_2,layout = "centrality", cent = deg)+
  geom_edge_fan(aes(edge_width=width), color = "lightblue", show.legend = F)+
  geom_node_point(aes(size=size), color="orange", alpha=0.7)+
  geom_node_text(aes(filter=deg>5, label=name), size = 5, repel = T)+
  scale_edge_width(range = c(0.2,1))+
  scale_size_continuous(range = c(1,10) )+
  guides(size=F)+
  theme_graph()

links_2 %>% tidygraph::as_tbl_graph() %>%
  ggraph(layout = "kk")+
  geom_edge_fan(color = "grey")+
  geom_node_point(size=5, color="blue", alpha=0.8)+
  geom_node_text(aes(label=name), repel = T)+
  theme_void()