Hisat2+Stringtie+DESeq2 workflow in R

前言

因为以前的分析流程要在Shell 和 R 之间切换,然后可控性还差,有点烦,正好最近有点数据要分析,干脆就重新建立一个 all in one 的流程,方便以后使用。

运行环境

Ubuntu 18.04
R 3.6.1

用到的软件

fastp
bowtie2
hisat2
stringtie
DESeq2

Refference data

rRNA index:用bowtie2去除fastq文件中的rRNA
tRNA index:http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/hg38-tRNAs.fa
Hisat2 index:https://cloud.biohpc.swmed.edu/index.php/s/grch38_tran/download
GRCh38.96.gtf:ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz
Homo_sapiens.GRCh38.dna.primary_assembly.fa:ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

代码

############## Functions #####################
#Hisat2 参数
grch38_tran_index = "/home/zhou/RNASEQ/index/grch38_tran/genome_tran"
rRNA_index = "/home/zhou/RNASEQ/index/rRNA_index/human_rRNA"
tRNA_index = "/home/zhou/RNASEQ/index/tRNA_index_hg38/hg38_tRNA"
GRCh38.96.gtf = "/home/zhou/RNASEQ/index/grch38_tran/Homo_sapiens.GRCh38.96.gtf"
fasta = "/home/zhou/RNASEQ/index/grch38_tran/Homo_sapiens.GRCh38.dna.primary_assembly.fa"

#fastp preprocess
##Single version
fastp_SE <- function(input_fastq = input_fastq) {
  fastp_PATH = "/home/zhou/anaconda2/bin/fastp"
  output_fastq = paste0(gsub(pattern = "\\.fastq\\.gz",replacement = "",x = input_fastq),"_fastpedited.fastq.gz")
  fastp_cmd = paste(fastp_PATH, "-i", input_fastq, "-o", output_fastq, "-w 16", "-z 9")
  system(fastp_cmd,wait = T)
  return(output_fastq)
}
##Paired version
fastp_PE <- function(input_fastq_F = input_fastq_F, input_fastq_R = input_fastq_R) {
  fastp_PATH = "/home/zhou/anaconda2/bin/fastp"
  output_fastq_F = paste0(gsub(pattern = "\\.fq\\.gz",replacement = "",x = input_fastq_F),"_fastpedited.fastq.gz")
  output_fastq_R = paste0(gsub(pattern = "\\.fq\\.gz",replacement = "",x = input_fastq_R),"_fastpedited.fastq.gz")
  fastp_cmd = paste(fastp_PATH, "-i", input_fastq_F, "-o", output_fastq_F, "-I", input_fastq_R, "-O", output_fastq_R, "-w 16", "-z 9")
  system(fastp_cmd,wait = T)
  return(c(output_fastq_F,output_fastq_R))
}

#rRNA remove
##Single version
remove_rRNA_SE <- function(input_fastq = input_fastq, index = index) {
  bowtie2_PATH = "/home/zhou/anaconda2/bin/bowtie2"
  output_fastq = paste0(gsub(pattern = "\\.fastq\\.gz",replacement = "",x = input_fastq),"_rmrRNA.fastq.gz")
  bowtie2_cmd = paste(bowtie2_PATH, "-x", index, "--un-gz", output_fastq, "-U", input_fastq, "-p 20 -S tmp.sam")
  system(bowtie2_cmd,wait = T)
  system("rm tmp.sam")
  return(output_fastq)
}

##Paired version
remove_rRNA_PE <- function(input_fastq_F = input_fastq_F, input_fastq_R = input_fastq_R, index = index) {
  bowtie2_PATH = "/home/zhou/anaconda2/bin/bowtie2"
  output_fastq = paste0(gsub(pattern = "\\.fastq\\.gz",replacement = "",x = input_fastq_F),"_rmrRNA.fastq.gz")
  output_fastq_F = gsub(pattern = "\\.fastq\\.gz",replacement = "\\.fastq\\.1\\.gz",x = output_fastq)
  output_fastq_R = gsub(pattern = "\\.fastq\\.gz",replacement = "\\.fastq\\.2\\.gz",x = output_fastq)
  output_fastq_F_reset = paste0(gsub(pattern = "\\.fastq\\.gz",replacement = "",x = input_fastq_F),"_rmrRNA.fastq.gz")
  output_fastq_R_reset = paste0(gsub(pattern = "\\.fastq\\.gz",replacement = "",x = input_fastq_R),"_rmrRNA.fastq.gz")
  bowtie2_cmd = paste(bowtie2_PATH, "-x", index, "--un-conc-gz", output_fastq, "-1", input_fastq_F, "-2", input_fastq_R, "-p 20 -S tmp.sam")
  rename_cmd_F = paste("mv", output_fastq_F, output_fastq_F_reset)
  rename_cmd_R = paste("mv", output_fastq_R, output_fastq_R_reset)
  system(bowtie2_cmd,wait = T)
  system("rm tmp.sam",wait = T)
  system(rename_cmd_F,wait = T)
  system(rename_cmd_R,wait = T)
  return(c(output_fastq_F_reset,output_fastq_R_reset))
}
#tRNA remove
##Single version
remove_tRNA_SE <- function(input_fastq = input_fastq, index = index) {
  bowtie2_PATH = "/home/zhou/anaconda2/bin/bowtie2"
  output_fastq = paste0(gsub(pattern = "\\.fastq\\.gz",replacement = "",x = input_fastq),"_rmtRNA.fastq.gz")
  bowtie2_cmd = paste(bowtie2_PATH, "-x", index, "--un-gz", output_fastq, "-U", input_fastq, "-p 20 -S tmp.sam")
  system(bowtie2_cmd,wait = T)
  system("rm tmp.sam")
  return(output_fastq)
}

##Paired version
remove_tRNA_PE <- function(input_fastq_F = input_fastq_F, input_fastq_R = input_fastq_R, index = index) {
  bowtie2_PATH = "/home/zhou/anaconda2/bin/bowtie2"
  output_fastq = paste0(gsub(pattern = "\\.fastq\\.gz",replacement = "",x = input_fastq_F),"_rmtRNA.fastq.gz")
  output_fastq_F = gsub(pattern = "\\.fastq\\.gz",replacement = "\\.fastq\\.1\\.gz",x = output_fastq)
  output_fastq_R = gsub(pattern = "\\.fastq\\.gz",replacement = "\\.fastq\\.2\\.gz",x = output_fastq)
  output_fastq_F_reset = paste0(gsub(pattern = "\\.fastq\\.gz",replacement = "",x = input_fastq_F),"_rmtRNA.fastq.gz")
  output_fastq_R_reset = paste0(gsub(pattern = "\\.fastq\\.gz",replacement = "",x = input_fastq_R),"_rmtRNA.fastq.gz")
  bowtie2_cmd = paste(bowtie2_PATH, "-x", index, "--un-conc-gz", output_fastq, "-1", input_fastq_F, "-2", input_fastq_R, "-p 20 -S tmp.sam")
  rename_cmd_F = paste("mv", output_fastq_F, output_fastq_F_reset)
  rename_cmd_R = paste("mv", output_fastq_R, output_fastq_R_reset)
  system(bowtie2_cmd,wait = T)
  system("rm tmp.sam",wait = T)
  system(rename_cmd_F,wait = T)
  system(rename_cmd_R,wait = T)
  return(c(output_fastq_F_reset,output_fastq_R_reset))
}
#hisat2 
##Single version
hisat2_SE <- function(index = index, input_fastq = input_fastq, output_prefix = output_prefix) {
  bam_filename = paste0(output_prefix,".bam")
  hisat2_PATH = "/home/zhou/bin/hisat2"
  hisat2_cmd = paste(hisat2_PATH, "--dta -x",index, "-p 20 -U", input_fastq, " | samtools sort -@ 20 -o", bam_filename)
  system(hisat2_cmd,wait = T)
  return(bam_filename)
}

##Paired version
hisat2_PE <- function(index = index, input_fastq_F = input_fastq_F,input_fastq_R = input_fastq_R, output_prefix = output_prefix) {
  bam_filename = paste0(output_prefix,".bam")
  hisat2_PATH = "/home/zhou/bin/hisat2"
  hisat2_cmd = paste(hisat2_PATH, "--dta -x",index, "-p 20 -1", input_fastq_F, "-2", input_fastq_R, " | samtools sort -@ 20 -o", bam_filename)
  system(hisat2_cmd,wait = T)
  return(bam_filename)
}

#stringtie
stringtie <- function(input_bam = input_bam, gtf = gtf) {
  output_prefix = gsub(pattern = "\\.bam",replacement = "",x = input_bam)
  tab_filename = paste0(output_prefix,".tab")
  gtf_filename = paste0(output_prefix,".gtf")
  stringtie_PATH = "/home/zhou/bin/stringtie"
  stringtie_cmd = paste(stringtie_PATH,"-p 20 -G", gtf, "-o", gtf_filename, "-e -l", output_prefix, input_bam)
  system(stringtie_cmd,wait = T)
  return(gtf_filename)
}

#generate gene\transcript count matrix
run.prepDE.py <- function(inputfile = inputfile) {
  run.cmd = paste("python ~/Software/prepDE.py -i",inputfile)
  system(run.cmd,wait = T)
}
################ test data ############################
setwd("~/ribosome/mTOR_Liver_cell/testdata/")
sample_pattern = unique(gsub(x = list.files(path = "./"), pattern = "_R1_test.fq.gz|_R2_test.fq.gz",replacement = ""))

for (i in sample_pattern) {
  print(i)
  ### fastp preprocess
  fastp.fastq = fastp_PE(input_fastq_F = paste0(i,"_R1_test.fq.gz"),input_fastq_R = paste0(i,"_R2_test.fq.gz"))
  ### remove rRNA
  rmrRNA.fastq = remove_rRNA_PE(input_fastq_F = fastp.fastq[1],input_fastq_R = fastp.fastq[2],index = rRNA_index)
  ### remove tRNA
  rmtRNA.fastq = remove_tRNA_PE(input_fastq_F = rmrRNA.fastq[1],input_fastq_R = rmrRNA.fastq[2],index = tRNA_index)
  ### hisat2 mapping
  hisat2.bam = hisat2_PE(index = grch38_tran_index,input_fastq_F = rmtRNA.fastq[1],input_fastq_R = rmtRNA.fastq[2],output_prefix = i )
  ### stringtie assemble
  stringtie(input_bam = hisat2.bam,gtf = GRCh38.96.gtf)
}


################ real data ############################
#rnaseq
setwd("~/ribosome/mTOR_Liver_cell/RNA-Rawdata/")
sample_pattern = unique(gsub(x = list.files(path = "./"), pattern = "_R1.fq.gz|_R2.fq.gz",replacement = ""))

for (i in sample_pattern) {
  print(i)
  ### fastp preprocess
  fastp.fastq = fastp_PE(input_fastq_F = paste0(i,"_R1.fq.gz"),input_fastq_R = paste0(i,"_R2.fq.gz"))
  ### remove rRNA
  rmrRNA.fastq = remove_rRNA_PE(input_fastq_F = fastp.fastq[1],input_fastq_R = fastp.fastq[2],index = rRNA_index)
  ### remove tRNA
  rmtRNA.fastq = remove_tRNA_PE(input_fastq_F = rmrRNA.fastq[1],input_fastq_R = rmrRNA.fastq[2],index = tRNA_index)
  ### hisat2 mapping
  hisat2.bam = hisat2_PE(index = grch38_tran_index,input_fastq_F = rmtRNA.fastq[1],input_fastq_R = rmtRNA.fastq[2],output_prefix = i )
  ### stringtie assemble
  stringtie(input_bam = hisat2.bam,gtf = GRCh38.96.gtf)
}

## generate counts matrix
mergelist = data.frame(sample_pattern,paste0("./",sample_pattern,".gtf"))
write.table(mergelist,file = "mergelist", sep = "\t",row.names = F,col.names = F,quote = F)
run.prepDE.py(inputfile = "mergelist")

## DEA
### load count_matrix
gene_matrix = read.csv(file = "gene_count_matrix.csv",header = T,row.names = 1)

library(DESeq2)
library(BiocParallel)
register(MulticoreParam(24))

database = gene_matrix
condition <- factor(c(rep("T",3),rep("UT",3)))
coldata <- data.frame(row.names = colnames(database), condition)
dds <- DESeqDataSetFromMatrix(countData=database, colData=coldata, design=~condition)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
dds <- DESeq(dds,parallel = T)
res <- results(dds)
summary(res) 
table(res$padj<0.1)
res <- res[order(res$padj),]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)),by="row.names",sort=FALSE)
total.RNA.signresdata<-na.omit(resdata[resdata$padj<0.1 & abs(resdata$log2FoldChange) > 1,])

#riboseq
setwd("~/ribosome/mTOR_Liver_cell/RIBO-Rawdata/")
sample_pattern = unique(gsub(x = list.files(path = "./"), pattern = "_R1.fq.gz|_R2.fq.gz",replacement = ""))

for (i in sample_pattern) {
  print(i)
  ### fastp preprocess
  fastp.fastq = fastp_PE(input_fastq_F = paste0(i,"_R1.fq.gz"),input_fastq_R = paste0(i,"_R2.fq.gz"))
  ### remove rRNA
  rmrRNA.fastq = remove_rRNA_PE(input_fastq_F = fastp.fastq[1],input_fastq_R = fastp.fastq[2],index = rRNA_index)
  ### remove tRNA
  rmtRNA.fastq = remove_tRNA_PE(input_fastq_F = rmrRNA.fastq[1],input_fastq_R = rmrRNA.fastq[2],index = tRNA_index)
  ### hisat2 mapping
  hisat2.bam = hisat2_PE(index = grch38_tran_index,input_fastq_F = rmtRNA.fastq[1],input_fastq_R = rmtRNA.fastq[2],output_prefix = i )
  ### stringtie assemble
  stringtie(input_bam = hisat2.bam,gtf = GRCh38.96.gtf)
}

## generate counts matrix
mergelist = data.frame(sample_pattern,paste0("./",sample_pattern,".gtf"))
write.table(mergelist,file = "mergelist", sep = "\t",row.names = F,col.names = F,quote = F)
run.prepDE.py(inputfile = "mergelist")

## DEA
### load count_matrix
gene_matrix = read.csv(file = "gene_count_matrix.csv",header = T,row.names = 1)

library(DESeq2)
library(BiocParallel)
register(MulticoreParam(24))

database = gene_matrix
condition <- factor(c(rep("T",3),rep("UT",3)))
coldata <- data.frame(row.names = colnames(database), condition)
dds <- DESeqDataSetFromMatrix(countData=database, colData=coldata, design=~condition)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
dds <- DESeq(dds,parallel = T)
res <- results(dds)
summary(res) 
table(res$padj<0.1)
res <- res[order(res$padj),]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)),by="row.names",sort=FALSE)
Ribo.RNA.signresdata<-na.omit(resdata[resdata$padj<0.1 & abs(resdata$log2FoldChange) > 1,])

结语

把所有功能都模块化了,真舒服啊!
此外,有些function可以增加参数来进一步增加自由度,比如说线程数,我懒得写了,其实可以慢慢完善的,以后看心情把,毕业季实在太忙了。。。

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,457评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,837评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,696评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,183评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,057评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,105评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,520评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,211评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,482评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,574评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,353评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,213评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,576评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,897评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,174评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,489评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,683评论 2 335

推荐阅读更多精彩内容