一、MultiQC的简介
二代测序技术的进步催生了高通量测序数据的生成。对于这些数据的质量评估,每一步分析结果的评估是后续结果可信度的衡量和保障。目前,有许多生信软件支持测序数据质控,如:对原始测序reads进行质控的FastQC软件、对RNA-Seq数据进行质控的RSeQC软件以及统计测序深度、覆盖度的Qualimap软件等等。当面对大批量样本,逐一查看样本的QC结果就显得十分耗时繁琐。但是,MultiQC的产生解决了这个问题,它可以将FastQC产生的多个输出报告,整合为一个,方便查看也方便比对。
二、MultiQC的安装
我们可以借助conda来安装MultiQC。
(base) yu@yu-virtual-machine:~$ conda --version #查看是否安装了conda
conda 4.12.0 #conda的版本号,若提示未找到conda命令,则需先安装conda
接下来可以安装MultiQC了:
为了避免在软件安装和运用的过程中将自己的base环境污染了,建议创建不同的虚拟环境分别管理不同的软件。
为MultiQC配置虚拟环境
1.搜索可安装的python版本
(base) yu@yu-virtual-machine:~$ conda search -f python
Loading channels: done
# Name Version Build Channel
python 2.7.13 hac47a24_15 pkgs/main
python 2.7.14 h1571d57_29 pkgs/main
python 2.7.15 h1571d57_0 pkgs/main
python 2.7.16 h8b3fad2_1 pkgs/main
... #此处省略诸多python版本
python 3.9.13 haa1d7c7_1 pkgs/main
python 3.10.0 h12debd9_0 pkgs/main
python 3.10.3 h12debd9_5 pkgs/main
python 3.10.4 h12debd9_0 pkgs/main
由于multiqc是用python2脚本写的,所以我们要为multiqc配置相应的python环境。
2.安装虚拟环境
(base) yu@yu-virtual-machine:~$ conda create -n py2env python=2.7
#名称为py2env,配置的python版本号为2.7
3.激活虚拟环境
(base) yu@yu-virtual-machine:~$ conda activate py2env
(py2env) yu@yu-virtual-machine:~$ #最左边的括号内即显示的是虚拟环境名称
4.在虚拟环境中配置清华源
按照清华镜像站的帮助文档配置镜像
(py2env) yu@yu-virtual-machine:~$ conda config --set show_channel_urls yes
(py2env) yu@yu-virtual-machine:~$ nano .condarc #将镜像站的.condarc文件内容直接复制进该文件
(py2env) yu@yu-virtual-machine:~$ conda clean -i # 清除索引缓存,保证用的是镜像站提供的索引
(py2env) yu@yu-virtual-machine:~$ conda config --show#检查镜像是否成功配置
尝试安装multiqc
(py2env) yu@yu-virtual-machine:~$ conda install multiqc
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- multiqc
Current channels:
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/linux-64
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/noarch
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r/linux-64
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r/noarch
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2/linux-64
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
发现报错
尝试解决:删除channels配置文件中部分内容(主要是删除此行: - defaults)
解决未果,依旧显示的是上述报错内容。
再次尝试解决:直接指定channel
(py2env) yu@yu-virtual-machine:~$ conda install multiqc -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
该解决方案依旧失败,具体报错如下:
(py2env) yu@yu-virtual-machine:~$ conda install multiqc -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: /
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Output in format: Requested package -> Available versions
改用pip进行安装
(py2env) yu@yu-virtual-machine:~$ pip install multiqc
报错,显示python版本太低,于是我将虚拟环境中的python升级为3.5版本的:
(py2env) yu@yu-virtual-machine:~$ conda install python=3.5
重新安装multiqc,安装到一半,又报错,显示默认的pip版本低,导致一些包无法下载,于是我又将pip更新到最新版本,顺便也更新了python:
(py2env) yu@yu-virtual-machine:~$ pip install --upgrade pip #更新
(py2env) yu@yu-virtual-machine:~$ pip --version #查看是否更新到最新版本
pip 20.3.4 from /home/yu/anaconda3/envs/py2env/lib/python3.5/site-packages/pip (python 3.5)
multiqc安装成功的显示
(py2env) yu@yu-virtual-machine:~$ multiqc --version
multiqc, version 1.13
三、MultiQC的运行
下载SSR数据
在NCBI官网的SRA数据库搜索SSR数据1.预下载SRR数据
(base) yu@yu-virtual-machine:~/biodata$ prefetch SRR11192680 #下载
This sra toolkit installation has not been configured. #显示sra工具未安装
Before continuing, please run: vdb-config --interactive
For more information, see https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/
(base) yu@yu-virtual-machine:~/biodata$ vdb-config --interactive #根据提示内容运行命令
再次prefetch
(base) yu@yu-virtual-machine:~/biodata$ prefetch SRR11192680
2022-09-24T11:27:44 prefetch.2.11.3: 1) 'SRR11192680' was downloaded successfully
(base) yu@yu-virtual-machine:~/biodata$ prefetch SRR11192681
2022-09-24T11:38:16 prefetch.2.11.3: 1) 'SRR11192681' was downloaded successfully
查看预下载情况:
(base) yu@yu-virtual-machine:~/biodata$ ll
drwxrwxr-x 2 yu yu 4096 9月 24 19:27 SRR11192680/
drwxrwxr-x 2 yu yu 4096 9月 24 19:38 SRR11192681/
2.下载SRR数据
(py2env) yu@yu-virtual-machine:~/biodata$ fastq-dump --gzip --split-files SRR11192680
(py2env) yu@yu-virtual-machine:~/biodata$ fastq-dump --gzip --split-files SRR11192681
#--gzip 使用gzip压缩结果 (目的是减少占用硬盘大小)
查看下载结果:
py2env) yu@yu-virtual-machine:~/biodata$ ll
-rw-rw-r-- 1 yu yu 4491094 9月 24 20:12 SRR11192680_1.fastq.gz
-rw-rw-r-- 1 yu yu 4941230 9月 24 20:12 SRR11192680_2.fastq.gz
-rw-rw-r-- 1 yu yu 935422 9月 24 20:12 SRR11192681_1.fastq.gz
-rw-rw-r-- 1 yu yu 1035421 9月 24 20:12 SRR11192681_2.fastq.gz
使用fastqc对数据进行质控
(py2env) yu@yu-virtual-machine:~/biodata$ fastqc SRR11192680_1.fastq.gz
(py2env) yu@yu-virtual-machine:~/biodata$ fastqc SRR11192680_2.fastq.gz
(py2env) yu@yu-virtual-machine:~/biodata$ fastqc SRR11192681_1.fastq.gz
(py2env) yu@yu-virtual-machine:~/biodata$ fastqc SRR11192681_2.fastq.gz
使用multiqc对质控结果进行汇总
查看multiqc结果
若使用的是虚拟机,则可以直接通过自带的firefox浏览器查看结果
(py2env) yu@yu-virtual-machine:~/biodata$ firefox multiqc_report.html
若使用的是mobaxterm远程登陆,则可以通过侧边栏直接下载
结果分析:
1.General Statistics:所有样本数据基本情况统计
%Dups——重复reads的比例
%GC——GC含量占总碱基的比例
M Seqs——总测序量(单位:millions)
2.Sequence Counts:序列计数
黑色代表重复reads的数量,蓝色代表独特reads的数量
3.Sequence Quality Histograms:各位置碱基的平均测序质量
横坐标代表碱基的位置,纵坐标代表质量分数
绿色区间——质量好,橙色区间——质量较好,红色区间——质量差
4.PerSequence Quality Scores:具有平均质量分数的reads的数量
横坐标代表平均序列质量分数,纵坐标代表reads数
绿色区间——质量很好,橙色区间——质量较好,红色区间——质量差
5.Per Base Sequence Content :各位置碱基比例
6.Per Sequence GC Content :平均GC含量
横坐标代表GC百分比,纵坐标代表数量
正常的样本的GC含量曲线会趋近于正态分布曲线,曲线形状的偏差往往是由于文库的污染或是部分reads构成的子集有偏差,形状接近正态但偏离理论分布的情况可能是有系统偏差
7.Per Base N Content :各位置N碱基含量
N碱基表示测序仪器无法识别该位置的碱基类别
8.Sequence Length Distribution:序列长度分布
9.Sequence Duplication Levels:序列相对重复水平
10.Overrepresented sequences:文库中过表达序列的比例
横坐标代表过表达序列的比例
若过表达的序列的比例都远远超过1%,则代表要么是转录本巨量表达,要么是样品被污染
11.Adapter Content:接头含量
12.上述11中质量评估汇总
PS:
虚拟环境的相关操作
1.查看当前存在哪些虚拟环境
conda info -e
conda env list
2.关闭虚拟环境
conda deactivate
3.删除虚拟环境
conda remove -n env_name --all
4.重命名(通过克隆的方法实现)
conda create --name new_name --clone old_name
conda remove --name old_name --all
参考:
https://blog.csdn.net/qazplm12_3/article/details/84550515
https://www.cnblogs.com/yuehouse/p/10239195.html
https://zhuanlan.zhihu.com/p/508480163
https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/
https://mp.weixin.qq.com/s__biz=MzAxMDkxODM1Ng==&mid=2247489078&idx=2&sn=5ec77cb921bfc2ece50cb170b7c33316&chksm=9b48568dac3fdf9b425c03596bc8b9585270f864eacec2cc92d8ee088439d77c8b024aa53f65&scene=21#wechat_redirect
http://events.jianshu.io/p/43680bdd42ae
https://www.jianshu.com/p/94bbeb800609