RNAseq全流程

RNAseq全流程

1. hisat2比对

2.进行bam文件排序和生成索引

3.进行基因计数

4. Deseq分析绘图

环境安装:

conda create -c conda-forge -c bioconda -n snakemake snakemake
conda activate snakemake
conda install bwa
conda install samtools
conda install -c bioconda hisat2
conda install bcftools
conda install -c bioconda spades
conda install -c bioconda iva
conda install -c bioconda quast
conda install -c bioconda bandage
conda install subread

1. hisat2比对

/home/sam/bin/hisat2 -p 16 -x "/mnt/d/big_data/RNAseq/hjy241226/mm10/genome" -1 "./C021/C021_R1.fq.gz" -2 "./C021/C021_R2.fq.gz" | samtools view -Sb - > "./result_bam/C021.bam"
解释代码
hisat2 -p 16 -x 基因组路径前缀 -1 *_R1.fq.gz -2 *_R2.fq.gz | samtools view -Sb - >
*.bam
通过管道符,跳过sam文件直接生成bam文件(sam~30GB 而bam~6GB)
-p 指定线程,建议10,或16 3+3GB ,约15min生成6GB左右的.bam文件
-1,-2 双端测序

conda activate snakemake
cd /mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata
/home/sam/bin/hisat2 -p 16 -x "/mnt/d/big_data/RNAseq/hjy241226/mm10/genome" -1 "./C021/C021_R1.fq.gz" -2 "./C021/C021_R2.fq.gz"  | samtools view -Sb - > "./result_bam/C021.bam"
(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata# /home/sam/bin/hisat2 -p 16 -x "/mnt/d/big_data/RNAseq/hjy241226/mm10/genome" -1 "./C021/C021_R1.fq.gz" -2 "./C021/C021_R2.fq.gz"  | samtools view -Sb - > "./result_bam/C021.bam"
43726462 reads; of these:
  43726462 (100.00%) were paired; of these:
    5229770 (11.96%) aligned concordantly 0 times
    36308824 (83.04%) aligned concordantly exactly 1 time
    2187868 (5.00%) aligned concordantly >1 times
    ----
    5229770 pairs aligned concordantly 0 times; of these:
      234461 (4.48%) aligned discordantly 1 time
    ----
    4995309 pairs aligned 0 times concordantly or discordantly; of these:
      9990618 mates make up the pairs; of these:
        7670921 (76.78%) aligned 0 times
        2129401 (21.31%) aligned exactly 1 time
        190296 (1.90%) aligned >1 times
91.23% overall alignment rate

文件树视图如下:

(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata# cat tree.txt
./
├── C021
│   ├── C021_R1.fq.gz
│   └── C021_R2.fq.gz
├── C022
│   ├── C022_R1.fq.gz
│   └── C022_R2.fq.gz
├── C023
│   ├── C023_R1.fq.gz
│   └── C023_R2.fq.gz
├── C024
│   ├── C024_R1.fq.gz
│   └── C024_R2.fq.gz
├── C025
│   ├── C025_R1.fq.gz
│   └── C025_R2.fq.gz
├── C026
│   ├── C026_R1.fq.gz
│   └── C026_R2.fq.gz
├── C027
│   ├── C027_R1.fq.gz
│   └── C027_R2.fq.gz
├── C028
│   ├── C028_R1.fq.gz
│   └── C028_R2.fq.gz
├── C029
│   ├── C029_R1.fq.gz
│   └── C029_R2.fq.gz
├── C030
│   ├── C030_R1.fq.gz
│   └── C030_R2.fq.gz
├── C031
│   ├── C031_R1.fq.gz
│   └── C031_R2.fq.gz
├── C032
│   ├── C032_R1.fq.gz
│   └── C032_R2.fq.gz
├── C033
│   ├── C033_R1.fq.gz
│   └── C033_R2.fq.gz
├── C034
│   ├── C034_R1.fq.gz
│   └── C034_R2.fq.gz
├── C035
│   ├── C035_R1.fq.gz
│   └── C035_R2.fq.gz
├── M1
│   ├── M1_R1.fq.gz
│   └── M1_R2.fq.gz
├── M2
│   ├── M2_R1.fq.gz
│   └── M2_R2.fq.gz
├── M3
│   ├── M3_R1.fq.gz
│   └── M3_R2.fq.gz
└── tree.txt

在result_bam文件夹中*.bam文件如下:

(base) root@wlab:~# conda activate snakemake
cd /mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata
(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata# ll
total 8
drwxrwxrwx 1 wlab wlab 4096 Mar  5 20:55 ./
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:48 ../
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:48 C021/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:49 C022/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:49 C023/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:50 C024/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:51 C025/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:51 C026/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:52 C027/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:52 C028/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:53 C029/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:54 C030/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:54 C031/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:55 C032/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:56 C033/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:56 C034/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:57 C035/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:57 M1/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:58 M2/
drwxrwxrwx 1 wlab wlab 4096 Mar  5 10:59 M3/
-rwxrwxrwx 1 wlab wlab  637 Mar  5 20:53 hisat2_0305.sh*
drwxrwxrwx 1 wlab wlab 4096 Mar  5 22:50 result_bam/
-rwxrwxrwx 1 wlab wlab 1452 Mar  5 20:35 tree.txt*
(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata# cd result_bam/
(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata/result_bam# ll
total 117633936
drwxrwxrwx 1 wlab wlab       4096 Mar  5 22:50 ./
drwxrwxrwx 1 wlab wlab       4096 Mar  5 20:55 ../
-rwxrwxrwx 1 wlab wlab 7663859512 Mar  5 23:00 C021.bam*
-rwxrwxrwx 1 wlab wlab 7882397946 Mar  5 21:17 C022.bam*
-rwxrwxrwx 1 wlab wlab 6379559184 Mar  5 21:19 C023.bam*
-rwxrwxrwx 1 wlab wlab 6819809013 Mar  5 21:20 C024.bam*
-rwxrwxrwx 1 wlab wlab 6577576520 Mar  5 21:37 C025.bam*
-rwxrwxrwx 1 wlab wlab 6500974785 Mar  5 21:40 C026.bam*
-rwxrwxrwx 1 wlab wlab 6416628503 Mar  5 21:40 C027.bam*
-rwxrwxrwx 1 wlab wlab 6077469669 Mar  5 22:02 C028.bam*
-rwxrwxrwx 1 wlab wlab 6648709068 Mar  5 22:09 C029.bam*
-rwxrwxrwx 1 wlab wlab 6522290778 Mar  5 22:09 C030.bam*
-rwxrwxrwx 1 wlab wlab 6676563366 Mar  5 22:14 C031.bam*
-rwxrwxrwx 1 wlab wlab 6545033472 Mar  5 22:15 C032.bam*
-rwxrwxrwx 1 wlab wlab 6374471577 Mar  5 22:16 C033.bam*
-rwxrwxrwx 1 wlab wlab 6895821251 Mar  5 22:29 C034.bam*
-rwxrwxrwx 1 wlab wlab 6324782655 Mar  5 22:40 C035.bam*
-rwxrwxrwx 1 wlab wlab 6500209105 Mar  5 22:40 M1.bam*
-rwxrwxrwx 1 wlab wlab 6862877164 Mar  5 22:41 M2.bam*
-rwxrwxrwx 1 wlab wlab 6788073312 Mar  5 22:41 M3.bam*
(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata/result_bam#

2.进行bam文件排序和生成索引

sorted-bam

snakefile

import glob
# 自动获取 raw_fastq/ 目录下所有 .fastq 文件的前缀
# SAMPLES = [os.path.basename(f).split(".")[0] for f in glob.glob("raw_fastq/*.fastq")]
SAMPLES = [os.path.basename(f).split(".")[0] for f in glob.glob("result_bam/*.bam")]

rule all:
    input: expand("sorted_bam/{sample}.bam.bai", sample=SAMPLES)  # 动态生成目标文件


rule samtools_sort:
    input:
        "result_bam/{sample}.bam"
    output:
        "sorted_bam/{sample}.bam"
    shell:
        "samtools sort -@ 8 -T sorted_bam/{wildcards.sample} "
        "-O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_bam/{sample}.bam"
    output:
        "sorted_bam/{sample}.bam.bai"
    shell:
        "samtools index {input}"

直接运行snakemake
先输出一个任务情况,空跑
snakmake -j 1 -np

运行同时运行4个任务 10:20
snakemake -j 4

第一个27min 10:47完成 -@ 8 j -4 , 此前fastq.gz 到.bam 单个15min, 同时多任务,20-40min
C023.bam (sorted_bam/C023.bam) 文件大概3GB
排序过程中大概生成4个临时文件, 00-03, 一个大概700MB-900MB
12:01跑完所有,18个文件 C021-C035, M1-M3,总耗时100min

snakemake -np -j 4 空跑内容如下:

[Thu Mar  6 10:20:43 2025]
rule all:
    input: sorted_bam/C021.bam.bai, sorted_bam/C022.bam.bai, sorted_bam/C023.bam.bai, sorted_bam/C024.bam.bai, sorted_bam/C025.bam.bai, sorted_bam/C026.bam.bai, sorted_bam/C027.bam.bai, sorted_bam/C028.bam.bai, sorted_bam/C029.bam.bai, sorted_bam/C030.bam.bai, sorted_bam/C031.bam.bai, sorted_bam/C032.bam.bai, sorted_bam/C033.bam.bai, sorted_bam/C034.bam.bai, sorted_bam/C035.bam.bai, sorted_bam/M1.bam.bai, sorted_bam/M2.bam.bai, sorted_bam/M3.bam.bai
    jobid: 0
    reason: Input files updated by another job: sorted_bam/C023.bam.bai, sorted_bam/C025.bam.bai, sorted_bam/C031.bam.bai, sorted_bam/C026.bam.bai, sorted_bam/C029.bam.bai, sorted_bam/M1.bam.bai, sorted_bam/C022.bam.bai, sorted_bam/C034.bam.bai, sorted_bam/M3.bam.bai, sorted_bam/C035.bam.bai, sorted_bam/C032.bam.bai, sorted_bam/C024.bam.bai, sorted_bam/C033.bam.bai, sorted_bam/C028.bam.bai, sorted_bam/M2.bam.bai, sorted_bam/C030.bam.bai, sorted_bam/C021.bam.bai, sorted_bam/C027.bam.bai
    resources: tmpdir=<TBD>

Job stats:
job               count
--------------  -------
all                   1
samtools_index       18
samtools_sort        18
total                37

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        all, samtools_index
    output files have to be generated:
        samtools_index, samtools_sort

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

运行结束后,
sorted_bam 文件夹内容如下:


(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata/sorted_bam# ll
total 55301144
drwxrwxrwx 1 wlab wlab       4096 Mar  6 12:01 ./
drwxrwxrwx 1 wlab wlab       4096 Mar  6 17:51 ../
-rwxrwxrwx 1 wlab wlab 3609090038 Mar  6 10:48 C021.bam*
-rwxrwxrwx 1 wlab wlab    3010776 Mar  6 10:50 C021.bam.bai*
-rwxrwxrwx 1 wlab wlab 3724322297 Mar  6 11:37 C022.bam*
-rwxrwxrwx 1 wlab wlab    3091976 Mar  6 11:59 C022.bam.bai*
-rwxrwxrwx 1 wlab wlab 3043192690 Mar  6 10:39 C023.bam*
-rwxrwxrwx 1 wlab wlab    2834520 Mar  6 11:11 C023.bam.bai*
-rwxrwxrwx 1 wlab wlab 3184078469 Mar  6 11:18 C024.bam*
-rwxrwxrwx 1 wlab wlab    2898016 Mar  6 11:19 C024.bam.bai*
-rwxrwxrwx 1 wlab wlab 3116544539 Mar  6 10:38 C025.bam*
-rwxrwxrwx 1 wlab wlab    2839000 Mar  6 11:44 C025.bam.bai*
-rwxrwxrwx 1 wlab wlab 3008024145 Mar  6 11:42 C026.bam*
-rwxrwxrwx 1 wlab wlab    2822648 Mar  6 11:43 C026.bam.bai*
-rwxrwxrwx 1 wlab wlab 3022579759 Mar  6 11:28 C027.bam*
-rwxrwxrwx 1 wlab wlab    2791944 Mar  6 11:29 C027.bam.bai*
-rwxrwxrwx 1 wlab wlab 2825269558 Mar  6 11:11 C028.bam*
-rwxrwxrwx 1 wlab wlab    2761912 Mar  6 11:12 C028.bam.bai*
-rwxrwxrwx 1 wlab wlab 3224637392 Mar  6 10:44 C029.bam*
-rwxrwxrwx 1 wlab wlab    2888328 Mar  6 11:13 C029.bam.bai*
-rwxrwxrwx 1 wlab wlab 3049312737 Mar  6 11:58 C030.bam*
-rwxrwxrwx 1 wlab wlab    2835352 Mar  6 12:00 C030.bam.bai*
-rwxrwxrwx 1 wlab wlab 3152610212 Mar  6 11:03 C031.bam*
-rwxrwxrwx 1 wlab wlab    2865288 Mar  6 11:04 C031.bam.bai*
-rwxrwxrwx 1 wlab wlab 3045104160 Mar  6 11:01 C032.bam*
-rwxrwxrwx 1 wlab wlab    2841544 Mar  6 11:02 C032.bam.bai*
-rwxrwxrwx 1 wlab wlab 2999032769 Mar  6 10:44 C033.bam*
-rwxrwxrwx 1 wlab wlab    2829944 Mar  6 11:58 C033.bam.bai*
-rwxrwxrwx 1 wlab wlab 3218564839 Mar  6 11:57 C034.bam*
-rwxrwxrwx 1 wlab wlab    2900240 Mar  6 12:00 C034.bam.bai*
-rwxrwxrwx 1 wlab wlab 2961499528 Mar  6 11:25 C035.bam*
-rwxrwxrwx 1 wlab wlab    2816896 Mar  6 11:27 C035.bam.bai*
-rwxrwxrwx 1 wlab wlab 3014598137 Mar  6 11:09 M1.bam*
-rwxrwxrwx 1 wlab wlab    2846152 Mar  6 11:10 M1.bam.bai*
-rwxrwxrwx 1 wlab wlab 3206850477 Mar  6 11:35 M2.bam*
-rwxrwxrwx 1 wlab wlab    2914192 Mar  6 11:37 M2.bam.bai*
-rwxrwxrwx 1 wlab wlab 3171318943 Mar  6 12:00 M3.bam*
-rwxrwxrwx 1 wlab wlab    2877808 Mar  6 12:01 M3.bam.bai*
(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata/sorted_bam#

3.进行基因计数

将 排序好的bam(已建立好索引)进行featureCounts
snakefile 和 config.yaml文件如下:

(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata# cat snakefile
shell.executable("/usr/bin/bash")
configfile: "./config.yaml"

import glob

# 自动获取 raw_fastq/ 目录下所有 .fastq 文件的前缀
# SAMPLES = [os.path.basename(f).split(".")[0] for f in glob.glob("raw_fastq/*.fastq")]

SAMPLES = [os.path.basename(f).split(".")[0] for f in glob.glob("sorted_bam/*.bam")]

rule all:
    input: expand("final_counts/{sample}_counts.txt", sample=SAMPLES)  # 动态生成目标文件

rule featureCounts:
    input:
        expand("sorted_bam/{sample}.bam", sample=SAMPLES)
    output:
        "final_counts/{sample}_counts.txt",
        "final_counts/{sample}_counts.txt.summary"
    log:
        "log/{sample}_featurecounts.log"
    shell:
        "featureCounts {input} -a {config[gtf]} -o {output[0]} "
        "-T {config[threads]} -t exon -p -g gene_id 2> {log}"
(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata# ^C
(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata# cat config.yaml

gtf: "/mnt/d/big_data/RNAseq/hjy241226/gencode.vM36.chr_patch_hapl_scaff.annotation.gtf"
index: "/mnt/d/big_data/RNAseq/hjy241226/mm10/genome"
threads: 8
###featurecounts strand information 1 notrand 2 trand 0 unknown,depend on the library kit
# strand: 2

运行结束后,
final_counts 文件夹内容如下:

(/home/snakemake) root@wlab:/mnt/e/MKX/NGS/CA-NAdef_micelung/ANNO_XS01KF2024030054_PM-XS01KF2024030054-67_20250103/Rawdata/final_counts# ll
total 685028
drwxrwxrwx 1 wlab wlab     4096 Mar  8 11:14 ./
drwxrwxrwx 1 wlab wlab     4096 Mar  6 17:51 ../
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C021_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C021_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C022_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C022_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C023_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C023_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C024_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C024_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C025_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C025_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C026_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C026_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C027_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C027_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C028_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C028_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C029_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C029_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C030_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C030_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C031_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C031_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C032_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C032_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C033_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C033_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C034_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C034_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965177 Mar  6 19:10 C035_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 C035_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965175 Mar  6 19:10 M1_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 M1_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965175 Mar  6 19:10 M2_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 M2_counts.txt.summary*
-rwxrwxrwx 1 wlab wlab 38965175 Mar  6 19:10 M3_counts.txt*
-rwxrwxrwx 1 wlab wlab     1743 Mar  6 19:10 M3_counts.txt.summary*

注意事项:
要加上-p 代表双端测序数据 否则报错
要安装subread,里面包含featureCounts函数
conda install subread
snakemake 运行成功
再试回snakefile---17:51-19:10---
总耗时 80min
(测试单个bam文件生成featureCounts只要1min)

4. Deseq分析绘图

DESeq分析小鼠组织RNAseq数据

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容