- 下载ubuntu桌面版本
https://ubuntu.com/download - VMware for windows 下载
https://my.vmware.com/web/vmware/details?downloadGroup=WKST-1510-WIN&productId=799&rPId=33369
官方需要注册,所以我在这个完整上找了个不是最新版本的进行下载:
https://www.cr173.com/soft/68480.html - 安装步骤主要参考以下这篇教程
如何在Windows环境下安装Linux系统虚拟机
安装过程
登录界面
安装git
sudo apt-get install git
安装miniconda
wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash /Miniconda3-latest-Linux-x86_64.sh
#添加conda channels
conda config --add channels bioconda
conda config --add channels conda-forge
#显示安装的channels
conda config --get channels
创建新的conda环境变量
#利用git clone命令将之前的snakemake pipline克隆到本地服务器
git clone https://github.com/fanqiang1024/RNA-seq_snakemake.git ./
Cloning into '.'...
remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (46/46), done.
remote: Compressing objects: 100% (36/36), done.
Unpacking objects: 60% (28/46)
remote: Total 46 (delta 17), reused 29 (delta 8), pack-reused 0
Unpacking objects: 100% (46/46), done.
ls
config.yaml dag.svg merge.R pairEndSnakemake
README.md snakefileR snakemake.yaml v3.11.0.tar.bz2
剩下的可以根据之前的教程进行snakemake分析环境的创建
conda环境变量设置与转移
利用snakemake构建分析流程
参考文章:
Writing a RNA-Seq workflow with snakemake
1.下载数据
cd samples/raw
for i in control_R1 control_R2 treated_R1 treated_R2;
do
wget --no-clobber http://pedagogix-tagc.univ-mrs.fr/courses/data/ngs/abd/${i}.fq.gz
2>/dev/null;
done
2.构建19号染色体的参考基因组索引
mkdir -p projects/indexes/bowtie2/mm10 # Create a directory to store the index
cd projects/indexes/bowtie2/mm10 # go to projects/indexes/bowtie2/mm10
# Download chr19 sequence (mm10 version)
wget --no-clobber http://hgdownload.soe.ucsc.edu/goldenPath/mm10/chromosomes/chr19.fa.gz 2> chr19.fa_wget.log
gunzip chr19.fa.gz # uncompress the file
# to get help about bowtie-build type:
# bowtie2-build -h
# To uncomment
bowtie2-build chr19.fa chr19 &> chr19.fa_bowtie2-build.log
3.下载gtf文件
cd projects
mkdir -p annotations/gtf
cd annotations/gtf
wget ftp://ftp.ensembl.org/pub/release-83/gtf/mus_musculus/Mus_musculus.GRCm38.83.chr.gtf.gz | gunzip -c | grep "^19" | sed 's/^19/chr19/' > GRCm38.83.chr19.gtf
4.在 UCSC web site 下载染色体信息
Clade: Mammal.
Genome: Mouse.
Assembly: mm10.
Group: All Tables.
Table: chromInfo.
Select all fields from selected table for output. Use the following string “chromInfo_mm10.txt” as output file name.
Click on get output
Create a directory projects/annotations/txt and place the downloaded file in that directory.
5. 创建Snakefile文件
BASE_DIR = "/Users/projects"
INDEX = BASE_DIR + "/indexes/bowtie2/mm10/chr19"
GTF = BASE_DIR + "/annotations/gtf/GRCm38.83.chr19.gtf"
CHR = BASE_DIR + "/annotations/txt/chromInfo_mm10.txt"
FASTA = BASE_DIR + "/indexes/bowtie2/mm10/chr19.fa"
#录入样本信息
SAMPLES, = glob_wildcards("samples/raw/{smp}_R1.fq.gz")
NB_SAMPLES = len(SAMPLES)
for smp in SAMPLES:
message("Sample " + smp + " will be processed")
#第一个rule,对rawdata进行trimming
rule final:
input: expand("samples/trimmed/{smp}_R1_t.fq", smp=SAMPLES)
rule trimming:
input: fwd="samples/raw/{smp}_R1.fq.gz", rev="samples/raw/{smp}_R2.fq.gz"
output: fwd="samples/trimmed/{smp}_R1_t.fq",
rev="samples/trimmed/{smp}_R2_t.fq",
single="samples/trimmed/{smp}_R1_singletons.fq"
message: """--- Trimming."""
shell: """
sickle pe -f {input.fwd} -r {input.rev} -l 25 -q 20 -t sanger -o {output.fwd} -p {output.rev} -s {output.single} &> {input.fwd}.log
"""
6. 增加质控rule
rule final:
input: expand("samples/fastqc/{smp}/{smp}_R1_t.fq_fastqc.zip", smp=SAMPLES)
rule trimming:
input: fwd="samples/raw/{smp}_R1.fq.gz", rev="samples/raw/{smp}_R2.fq.gz"
output: fwd="samples/trimmed/{smp}_R1_t.fq",
rev="samples/trimmed/{smp}_R2_t.fq",
single="samples/trimmed/{smp}_R1_singletons.fq"
message: """--- Trimming reads."""
shell: """
sickle pe -f {input.fwd} -r {input.rev} -l 40 -q 20 -t sanger -o {output.fwd} -p {output.rev} -s {output.single} &> {input.fwd}.log
"""
rule fastqc:
input: fwd="samples/trimmed/{smp}_R1_t.fq",
rev="samples/trimmed/{smp}_R2_t.fq"
output: fwd="samples/fastqc/{smp}/{smp}_R1_t.fq_fastqc.zip", rev="samples/fastqc/{smp}/{smp}_R2_t.fq_fastqc.zip"
message: """--- Quality check of raw data with Fastqc."""
shell: """
fastqc --outdir samples/fastqc/{wildcards.smp} --extract -f fastq {input.fwd} {input.rev}
"""
7. 对数据进行mapping
rule tophat:
input:fwd="samples/trimmed/{smp}_R1_t.fq",
rev="samples/trimmed/{smp}_R2_t.fq"
params: gtf=GTF, index=INDEX
output: "samples/bam/{smp}.bam"
shell: """
mkdir -p samples/tophat/{wildcards.sample}
tophat2 \
-o samples/tophat/{wildcards.sample} \
-g 1 \
--library-type fr-firststrand \
-G {params.gtf} \
-x 1 \
-p 5 \
{params.index} \
{input.fwd} {input.rev} &> samples/tophat/{wildcards.sample}/run_tophat.log
cd samples/tophat/{wildcards.sample}
mv accepted_hits.bam ../../bam/{wildcards.smp}.bam
"""
8.寻找novel转录本
rule cufflinks:
input: bam="samples/bam/{smp}.bam"
output: gtf="samples/cufflinks/{smp}/transcripts.gtf"
params: gtf=GTF
message: "--- Searching novel transcript with cufflinks."
shell: """
cufflinks -g {params.gtf} -p 5 --library-type fr-firststrand -o samples/cufflinks/{wildcards.smp} {input.bam} &> {output}.log
"""
9. 将novel转录本与已知的进行比对
10. 合并novel和已知的转录本和基因名
11. 利用featureCounts进行定量
也可以使用htseq-count进行定量
rule quantification_with_featureCounts:
input: novel="samples/new_annotation/all_transcripts.gtf", bam=expand("samples/bam/{smp}.bam", smp=SAMPLES)
output: "results/counts/gene_counts.txt", "results/counts/gene_counts_mini.txt"
shell: """
featureCounts -p -s 2 -T 15 -t exon -g gene_id -a {input.novel} -o {output[0]} {input.bam} &> {output[0]}.log
cut -f 1,7- {output[0]}| awk 'NR > 1' | awk '{{gsub("samples/bam/","",$0); print}}' > {output[1]}
"""
12. R语言作图展示
rule diagnostic_plot:
input: "results/counts/gene_counts_mini.txt"
output: "results/diagnostic_plot/diagnostic.pdf"
run: R("""
dir.create("results/diagnostic_plot")
data <- read.table("{input}",
sep="\t",
header=T,
row.names=1)
data <- data[rowSums(data) > 0, ]
data <- log2(data + 1)
pdf("{output}")
dev.null <- apply(data, 2, hist, border="white", col="blue")
boxplot(data, color="blue", pch=16)
pairs(data, pch=".", col="blue")
dev.off()
cat("etc...")
""")