教程在:肿瘤外显子数据处理系列教程(二)质控与去接头 (qq.com)
安装软件
这里要用到fastqc和python,先检查自己有没有
fastqc -h
如果没有就要按照教程进行安装,如下:
## fastqc使用多线程要用--threads
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.zip
unizp fastqc_v0.11.8.zip
ln -s ~/tools/FastQC/fastqc ~/tools/bin/fastqc
mkdir -p 2.qc/{pre, post}
nohup fastqc --outdir ../2.qc/pre --threads 16 *.fastq.gz > ../2.qc/pre/fastqc.log 2>&1 &
## 需要先安装python
## sudo apt install libffi-dev
cd
cd tools/
mkdir python
wget https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tgz
tar zxvf Python-3.7.3.tgz
cd Python-3.7.3
./configure --prefix="/home/llwu/tools/python/"
make
make install
cd ../python/bin
ls * | while read id
do
ln -s ~/tools/python/bin/${id} ~/tools/bin/${id}
done
这里的链接可能也是过时的,需要去找新的进行下载
我检查了一下我的python的版本是
(/home/jiarongf/my-envs/wes) 10:11:01 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra
$
python --version
Python 2.7.15
不知道后面是否会有影响,结果发现自己的安装实在home下的环境,所以赶紧去更改了一下,具体可以看我发布的一篇更改虚拟环境位置的文章,
更改之后就python就是3.8的了
(/home/jiarongf/my-envs/wes) 10:36:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra
$
python --version
Python 3.8.8
更改了之后记得重新进行wes的环境
source activate wes
(wes) 11:18:00 jiarongf@172.16.10.223:/data1/jiarongf
还会用到MultiQC,进行安装,也是直接安装在run文件夹下
下面是安装的脚本,可以分开运行
curl -LOk https://github.com/ewels/MultiQC/archive/master.zip
unzip master.zip
cd MultiQC-master
python setup.py install
下载了之后才发现这个可以用pip或者conda安装
Installation
You can install MultiQC from PyPI
using pip
as follows:
bash
pip install multiqc
Alternatively, you can install using Conda
from the bioconda channel:
conda install -c bioconda multiqc
If you would like the development version instead, the command is:
pip install --upgrade --force-reinstall git+https://github.com/ewels/MultiQC.git
但是此处我已经下载了,所以直接安装就好了
python setup.py install
检测安装是否成功
multiqc -h
去接头前质控
首先需要把之前生成的fastq的文件放到一个目录下,定义为raw_data
(wes) 15:48:37 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
vim pre_qc_report.sh
(wes) 15:49:31 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh pre_qc_report.sh
/// MultiQC 🔍 | v1.13.dev0
| multiqc | Report title: QC REPORT OF SRP070662
| multiqc | Search path : /data1/jiarongf/learning/cancer-WES/0.sra/raw_data
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 60/60
| multiqc | No analysis results found. Cleaning up..
| multiqc | MultiQC complete
(wes) 15:49:51 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat pre_qc_report.sh
multiqc ../0.sra/raw_data/ -n pre_qc_report -p -i " QC REPORT OF SRP070662" -o multiqc
(wes) 15:52:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh pre_qc_report.sh > ../log/pre_qc_report.log 2>&1
但是感觉没有输出结果
这里突然感觉应该先fastqc,然后再去multi,找到一个好的教程就是从下载sra数据,解压为fastq,在做fastqc在multi,然后合成一个报告,还有报告结果的解读都很不错
MultiQC使用 | 码农家园 (codenong.com)
fastqc
浅试一个
(wes) 15:58:02 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
fastqc /data1/jiarongf/learning/cancer-WES/0.sra/raw_data/case3_techrep_2_WES_2.fastq.gz
Started analysis of case3_techrep_2_WES_2.fastq.gz
Approx 5% complete for case3_techrep_2_WES_2.fastq.gz
Approx 10% complete for case3_techrep_2_WES_2.fastq.gz
Approx 15% complete for case3_techrep_2_WES_2.fastq.gz
Approx 20% complete for case3_techrep_2_WES_2.fastq.gz
Approx 25% complete for case3_techrep_2_WES_2.fastq.gz
Approx 30% complete for case3_techrep_2_WES_2.fastq.gz
Approx 35% complete for case3_techrep_2_WES_2.fastq.gz
Approx 40% complete for case3_techrep_2_WES_2.fastq.gz
Approx 45% complete for case3_techrep_2_WES_2.fastq.gz
Approx 50% complete for case3_techrep_2_WES_2.fastq.gz
Approx 55% complete for case3_techrep_2_WES_2.fastq.gz
Approx 60% complete for case3_techrep_2_WES_2.fastq.gz
Approx 65% complete for case3_techrep_2_WES_2.fastq.gz
Approx 70% complete for case3_techrep_2_WES_2.fastq.gz
Approx 75% complete for case3_techrep_2_WES_2.fastq.gz
Approx 80% complete for case3_techrep_2_WES_2.fastq.gz
Approx 85% complete for case3_techrep_2_WES_2.fastq.gz
Approx 90% complete for case3_techrep_2_WES_2.fastq.gz
Approx 95% complete for case3_techrep_2_WES_2.fastq.gz
Analysis complete for case3_techrep_2_WES_2.fastq.gz
(wes) 16:05:53 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
时间太久了,建议后台跑,再有就是这个成功的,
跑完以后再原来的目录生成了
这只是一个的,所以需要循环一下,run目录下写个脚本
前面的脚本
pre_qc_report.sh
也需要改个名字
(wes) 16:08:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
mv pre_qc_report.sh pre_qc_multiqc.sh
上述做错了,重新搞
去接头前质控
(wes) 16:33:05 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES
$
cd /data1/jiarongf/learning/cancer-WES
(wes) 16:31:37 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES
$
mkdir -p 2.qc/{pre,post}
(wes) 16:33:01 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES
$
l
0.sra/ 2.qc/ data/ log/ run/
(wes) 16:33:02 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES
$
ls 2.qc/
post pre
质控
nohup fastqc --outdir ../2.qc/pre --threads 16 *.fastq.gz > ../2.qc/pre/fastqc.log 2>&1 &
(wes) 16:35:55 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
vim pre_fastqc.sh
(wes) 16:41:22 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat pre_fastqc.sh
nohup fastqc --outdir ../2.qc/pre --threads 16 ../0.sra/raw_data/*.fastq.gz > ../log/pre.fastqc.log 2>&1 &
(wes) 16:41:09 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh pre_fastqc.sh
(wes) 16:41:30 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
直接挂到后台去,
一共有60个,
16个16个的跑,感觉也要跑很久,可以先做去接头
跑完啦
/data1/jiarongf/learning/cancer-WES/2.qc/pre
可以multi了
10:24:45 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
vim pre_qc_multiqc.sh
10:26:37 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat pre_qc_multiqc.sh
multiqc /data1/jiarongf/learning/cancer-WES/2.qc/pre -n pre_qc_report -p -i " QC REPORT OF SRP070662" -o /data1/jiarongf/learning/cancer-WES/2.qc/pre/
10:26:42 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh pre_qc_multiqc.sh > ../log/pre_qc_multiqc.sh.log 2>&1
10:28:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
看结果
1.general statistical
2.fastqc
2.1 Sequence Counts
黑色的是重复序列,蓝色的是有用的序列(reads)
可以看到重复序列的比例还是小的
2.2 Sequence Quality Histograms
每一条线是一个样本,应该打分都在绿色区域就会好一点
绿色区间——质量很好,橙色区间——质量合理,红色区间——质量不好
这60个都是绿色的线,代表测序质量挺好的
2.3 Per Sequence Quality Scores
绿色区间——质量很好、橙色区间——质量合理、红色区间——质量不好
2.4 Per base Sequence Content
结果显示3序列都warn,如果是红色的说明每个位置每种碱基出现的概率差别很大,可能有过表达序列的污染。
2.5 Per Sequence GC Content
以下是一个错误的例子
这里结果显示四条序列都被报错,从形状上来看曲线和正态曲线相差甚远,可能是由于文库的污染或是部分reads构成的子集有偏差造成的
2.6 Per Base N Content 每条reads各位置N碱基含量比例
说明测序仪器能辨别这60个样本中每条reads的每个位置的碱基
2.7 Sequence Length Distribution 序列长度分布
对于这几个样本,每次测序仪测出来的长度主要都在76bp
2.8 Sequence Duplication Levels 每个序列的相对重复水平
2.9 Overrepresented sequences 文库中过表达序列的比例
这些序列中过表达的序列的比例都远远超过1%,如果有的序列中过表达的序列超过50%,这种情况的出现,不是这种转录本巨量表达,就是样品被污染。
2.10 Adapter Content 接头含量
这是没有接头的意思????不太清楚了,一会和去接头以后的再比较以下,看看有什么差别