Q1:找到文章并下载测序数据
- 文章:
- 2018年12月的NC文章:Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing 使用成熟的
单细胞转录组( Smart-seq2 )
手段探索了癌相关的成纤维细胞 CAFs的功能和空间异质性。 -
文章中表明数据存储地址
上GEO官网,搜索id号---下载id列表(由于内存限制,我只下载了48 49 50 51) ---将列表保存至linux系统下的text.txt文件中
- 2018年12月的NC文章:Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing 使用成熟的
- 下载SRA文件
prefetch SRR3589948 #先试一下,跑完命令之后会产生名为ncbi的文件夹
cat text.txt |while read id ;do echo $id;done
cat text.txt |while read id ;do prefetch $id ;done
#下载来的基因id的列表保存为text.txt,可以将此命令保存为.sh 文件,加上nohup .sh & 后台下载,下载完成后,主目录下会生成ncbi文件夹,里面包含了SRA文件
Q4: 任意挑选4个样本SRA格式转换为fastq文件格式
fastq-dump --gzip --split-3 -O ~/rna-seq/ /trainee1/vip77/ncbi/public/sra/SRR3589948.sra
fastq-dump --gzip --split-3 -O ~/rna-seq/ /trainee1/vip77/ncbi/public/sra/SRR3589*#批量转换,下载了好长时间,感觉要两个小时的样子
Q5 质控
- 简单质控
fastqc -t 2 SRR3589948_1.fastq.gz #大概要3分钟的样子,生成.html文件
- 批量质控
nohup fastqc -t 2 *.fastq.gz &
ls *gz | xargs fastqc -t 2 #后台批量质控
multiqc ./*zip # 整合所有qc结果 #产生multiqc开头的文件,html文件可以ftp传输至window打开看
Q6 过滤数据(去除低质量数据及接头)
source activate rna
conda install -y trim-galore
#简单运行,保存为.sh文件,虽然简单,但是还是运行了月25分钟
chmod +x .sh文件名
bin_trim_galore=trim_galore
dir='/trainee1/vip77/rna-seq/clean'
fq1='SRR3589948_1.fastq.gz'
fq2='SRR3589948_2.fastq.gz'
$bin_trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $d$
#运行之后生成文件,
ls
SRR3589948_1.fastq.gz_trimming_report.txt
SRR3589948_1_val_1.fq.gz
SRR3589948_2.fastq.gz_trimming_report.txt
SRR3589948_2_val_2.fq.gz
#批量处理,并后台运行
#找出SRR3589948_1.fastq.gz 文件所在位置路径,并复制路径
ls /trainee1/vip77/rna-seq/*1.fastq.gz >1
ls /trainee1/vip77/rna-seq/*2.fastq.gz >2
paste 1 2
paste 1 2 > config
#之后重新写过滤数据的脚本
bin_trim_galore=trim_galore
dir='/trainee1/vip77/rna-seq/clean' #文件输出地址
cat $config |while read id
do
arr=(${id})
fq1=${arr[0]}
fq2=${arr[1]}
nohup $bin_trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $dir $fq1 $fq2 &
done
过滤完之后,可以再进行一次fastqc,对比前后,如果结果不理想,需要更改参数,继续过滤
Q7 比对 + counts
- 截取数据
ls *gz | while read id ;do(echo $(basename $id));done #取出basename
mkdir test3
cd test3
ls /trainee1/vip77/rna-seq/*gz | while read id ;do (zcat $id |head - 1000> $(basename $id) );done # 我用的文件是没有经过过滤的文件,过滤时间太长了,等不及想先试试
ls -hl
total 0
-rw-rw-r-- 1 vip77 vip77 0 May 8 21:02 SRR3589948_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 0 May 8 21:02 SRR3589948_2.fastq.gz
-rw-rw-r-- 1 vip77 vip77 0 May 8 21:02 SRR3589949_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 0 May 8 21:02 SRR3589949_2.fastq.gz
-rw-rw-r-- 1 vip77 vip77 0 May 8 21:02 SRR3589950_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 0 May 8 21:02 SRR3589950_2.fastq.gz
-rw-rw-r-- 1 vip77 vip77 0 May 8 21:02 SRR3589951_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 0 May 8 21:02 SRR3589951_2.fastq.gz
# >前面少加了空格,其实这些文件并非压缩文件
ls /trainee1/vip77/rna-seq/*gz | while read id ;do (zcat $id |head -1000 > $(basename $id) );done # 我用的文件是没有经过过滤的文件,过滤时间太长了,等不及想先试试
ls -hl
total 480K
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:10 SRR3589948_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:10 SRR3589948_2.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:10 SRR3589949_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:10 SRR3589949_2.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:10 SRR3589950_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:10 SRR3589950_2.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:10 SRR3589951_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:10 SRR3589951_2.fastq.gz
#其实这些文件并非压缩文件
- 比对
- 转录组数据一般使用软件:hisat2 ,subjunc,star
基因组比对一般使用:BWA,bowtie2 - 构建索引(已经构建好索引,时间太长)
索引文件位置:/teach/database/index
/teach/database/index/hisat
/teach/database/index/bwa
/teach/database/index/salmon
参考基因组位置:/teach/database/genome
注释文件位置 :/teach/database/gtf
gencode.v25.annotation.gtf.gz
gencode.v29.annotation.gtf.gz
- 转录组数据一般使用软件:hisat2 ,subjunc,star
- 简单比对
- hisat比对
-p 分配线程,-x 后接索引文件
- hisat比对
hisat2 -p 2 -x /teach/database/index/hisat/hg38/genome -1 SRR3589948_1.fastq.gz -2 SRR3589948_2.fastq.gz
gzip: SRR3589948_1.fastq.gz: not in gzip format
gzip: SRR3589948_2.fastq.gz: not in gzip format# 有可能是没有识别.gz文件非压缩文件,所以我们换下文件名
ls /trainee1/vip77/rna-seq/*gz | while read id ;do (zcat $id |head -1000 > $(basename $id ".gz" ) );done #可以把文件名后.gz去除
ls -hl
total 960K
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589948_1.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:27 SRR3589948_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589948_2.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:27 SRR3589948_2.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589949_1.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:27 SRR3589949_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589949_2.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:27 SRR3589949_2.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589950_1.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:27 SRR3589950_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589950_2.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:27 SRR3589950_2.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589951_1.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:27 SRR3589951_1.fastq.gz
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589951_2.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:27 SRR3589951_2.fastq.gz
hisat2 -p 2 -x /teach/database/index/hisat/hg38/genome -1 SRR3589948_1.fastq -2 SRR3589948_2.fastq
250 reads; of these:
250 (100.00%) were paired; of these:
24 (9.60%) aligned concordantly 0 times
24 (9.60%) aligned concordantly exactly 1 time
202 (80.80%) aligned concordantly >1 times
----
24 pairs aligned concordantly 0 times; of these:
0 (0.00%) aligned discordantly 1 time
----
24 pairs aligned 0 times concordantly or discordantly; of these:
48 mates make up the pairs; of these:
27 (56.25%) aligned 0 times
4 (8.33%) aligned exactly 1 time
17 (35.42%) aligned >1 times
94.60% overall alignment rate
hisat2 -p 2 -x /teach/database/index/hisat/hg38/genome -1 SRR3589948_1.fastq -2 SRR3589948_2.fastq -S tem.sam # 结果输出为tem.sam
- subjunc比对 # 没有找到参考基因组,先略过
- Bwa比对
source activate rna
conda install bwa
bwa
bwa index
bwa mem
bwa mem -t 2 -M /teach/database/index/bwa/hg38/hg38 SRR3589948_1.fastq SRR3589948_2.fastq >bwa.sam
- 批量比对
for i in {48..51};do (bwa mem -t 2 -M /teach/database/index/bwa/hg38/hg38 SRR35899${i}_1.fastq SRR35899${i}_2.fastq > SRR35899${i}.bwa.sam );done
for i in {48..51};do(hisat2 -p 2 -x /teach/database/index/hisat/hg38/genome -1 SRR35899${i}_1.fastq -2 SRR35899${i}_2.fastq> SRR35899${i}.hisat.sam);done
#生成新文件 ,all.id.txt 为表达矩阵
-rw-rw-r-- 1 vip77 vip77 33M May 10 17:05 all.id.txt
-rw-rw-r-- 1 vip77 vip77 451 May 10 17:05 all.id.txt.summary
-rw-rw-r-- 1 vip77 vip77 7.7K May 10 17:05 counts.id.bwa.log
-rw-rw-r-- 1 vip77 vip77 8.9K May 10 17:04 counts.id.hisat.log
- count
featureCounts -T 2 -p -t exon -g gene_id -a /teach/database/gtf/gencode.v29.annotation.gtf.gz -o all.id.txt *bwa.bam 1>counts.id.bwa.log 2>&1 &
featureCounts -T 2 -p -t exon -g gene_id -a /teach/database/gtf/gencode.v29.annotation.gtf.gz -o all.id.txt *hisat.bam 1>counts.id.hisat.log 2>&1 &
- sam文件转bam文件并排序--bam文件建立索引--
索引构建成功后 + 到IGV中查看
+ 对比对好的,再进行qc :可以用samtools的命令
ls *hisat.sam|while read id ;do (samtools sort -O bam -@ 5 -o $(basename ${id} "hisat.sam")hisat.bam ${id});done
ls *bwa.sam|while read id ;do (samtools sort -O bam -@ 5 -o $(basename ${id} "bwa.sam")bwa.bam ${id});done # @表示线程数
rm *.sam
ls *hisat.bam |xargs -i samtools index {}
ls *bwa.bam |xargs -i samtools index {}
# 建立索引之后会生成bam.bai 后缀的文件,
ls *.bam |while read id ;do ( samtools flagstat -@ 1 $id > $(basename ${id} ".bam").flagsta );done
mkdir flagstat
mv *flagstat flagstat/
#在对建立好索引的文件统计,里面包括了一些统计信息
cd flagstat
multiqc ./ #结果出来挺别扭的,也许跟数据有关
#统计*flagstat中信息,包括比对后的浓缩的信息,每个样本比对信息等转换成excel表格(最好用shell 脚本提取的方法),
之后,就可以用all.id.txt 在R做表达差异分析
补充
找文件id名
ls clean/*gz | while read id ;do (echo $id );done
ls clean/*gz | while read id ;do (echo $(basename $id) );done
ls clean/*gz | while read id ;do (zcat $id |head - 1000>$(basename $id) );done
删减文件名:
ls
SRR3589948_1.fastq.gz SRR3589949_2.fastq.gz SRR3589951_1.fastq.gz
SRR3589948_2.fastq.gz SRR3589950_1.fastq.gz SRR3589951_2.fastq.gz
SRR3589949_1.fastq.gz SRR3589950_2.fastq.gz
ls *gz | while read id ;do (echo ${id%%.*});done
SRR3589948_1
SRR3589948_2
SRR3589949_1
SRR3589949_2
SRR3589950_1
SRR3589950_2
SRR3589951_1
SRR3589951_2
没有弄懂的问题:
question 1:
使用生信技能树批量比对代码,出现错误
nano 批量比对2.sh
ls *fastq|cut -d"_" -f 1 |sort -u |while read id;do
ls -lh ${id}_1.fastq ${id}_2.fastq;done
hisat2 -p 2 -x /teach/database/index/hisat/hg38/genome -1 ${id}_1.fastq -2 ${id}_2.fastq -S ${id}.hisat2.sam
./批量比对2.sh
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589948_1.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589948_2.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589949_1.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589949_2.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589950_1.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589950_2.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589951_1.fastq
-rw-rw-r-- 1 vip77 vip77 58K May 8 21:29 SRR3589951_2.fastq
Warning: Could not open read file "_1.fastq" for reading; skipping...
Error: No input read files were valid
(ERR): hisat2-align exited with value 1
question 2
bam文件why建立索引
参考:https://github.com/jmzeng1314/GEO
https://www.jianshu.com/p/a84cd44bac67
http://www.bio-info-trainee.com/3920.html
https://www.bilibili.com/video/av28453557/?p=10
https://www.jianshu.com/p/22f047c26fa8
转录组定量