教程在:肿瘤外显子数据处理系列教程(二)质控与去接头 (qq.com)
去接头
1.安装TrimGalore
curl https://codeload.github.com/FelixKrueger/TrimGalore/zip/0.6.1 TrimGalore.zip
unzip TrimGalore.zip
ln -s ~/tools/TrimGalore-0.6.1/trim_galore ~/tools/bin/trim_galore
在查找这个软件的时候发现了一个博主介绍的很好,此处复制一点自己觉得有用的
Trim Galore是对FastQC和Cutadapt的包装。适用于所有高通量测序,包括RRBS(Reduced Representation Bisulfite-Seq ), Illumina、Nextera 和smallRNA测序平台的双端和单端数据。主要功能包括两步:
第一步首先去除低质量碱基,然后去除3' 末端的adapter, 如果没有指定具体的adapter,程序会自动检测前1million的序列,然后对比前12-13bp的序列是否符合以下类型的adapter:
- Illumina: AGATCGGAAGAGC
- Small RNA: TGGAATTCTCGG
- Nextera: CTGTCTCTTATA
但是这个作者并没有用conda安装成功
(wes) 16:45:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
conda install trim-galore
Collecting package metadata (current_repodata.json): -
fontconfig conda-forge/linux-64::fontconfig-2.14.0-h8e229c2_0
freetype conda-forge/linux-64::freetype-2.10.4-h0708190_1
libnsl conda-forge/linux-64::libnsl-2.0.0-h7f98852_0
libpng conda-forge/linux-64::libpng-1.6.37-h21135ba_2
libuuid conda-forge/linux-64::libuuid-2.32.1-h7f98852_1000
openjdk conda-forge/linux-64::openjdk-8.0.312-h7f98852_0
perl conda-forge/linux-64::perl-5.32.1-2_h7f98852_perl5
pigz conda-forge/linux-64::pigz-2.6-h27826a3_0
trim-galore bioconda/noarch::trim-galore-0.6.7-hdfd78af_0
xopen bioconda/noarch::xopen-0.7.3-py_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
pigz-2.6 | 87 KB | ##################################### | 100%
fontconfig-2.14.0 | 305 KB | ##################################### | 100%
xopen-0.7.3 | 11 KB | ##################################### | 100%
fastqc-0.11.9 | 9.7 MB | ##################################### | 100%
perl-5.32.1 | 14.4 MB | ##################################### | 100%
freetype-2.10.4 | 890 KB | ##################################### | 100%
libnsl-2.0.0 | 31 KB | ##################################### | 100%
libpng-1.6.37 | 306 KB | ##################################### | 100%
cutadapt-1.18 | 199 KB | ##################################### | 100%
expat-2.4.8 | 187 KB | ##################################### | 100%
openjdk-8.0.312 | 97.6 MB | ##################################### | 100%
bz2file-0.98 | 9 KB | ##################################### | 100%
libuuid-2.32.1 | 28 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
可以看到依赖的环境很多,要一个一个下载
(wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
trim_galore --help
···
within the other read).
NOTE: If you are planning to use Bowtie2, BWA etc. you don't need to specify this option.
--retain_unpaired If only one of the two paired-end reads became too short, the longer
read will be written to either '.unpaired_1.fq' or '.unpaired_2.fq'
output files. The length cutoff for unpaired single-end reads is
governed by the parameters -r1/--length_1 and -r2/--length_2. Default: OFF.
-r1/--length_1 <INT> Unpaired single-end read length cutoff needed for read 1 to be written to
'.unpaired_1.fq' output file. These reads may be mapped in single-end mode.
Default: 35 bp.
-r2/--length_2 <INT> Unpaired single-end read length cutoff needed for read 2 to be written to
'.unpaired_2.fq' output file. These reads may be mapped in single-end mode.
Default: 35 bp.
Last modified on 07 October 2020.
安装成功
(wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cutadapt --help
cutadapt version 1.18
Copyright (C) 2010-2018 Marcel Martin <marcel.martin@scilifelab.se>
cutadapt removes adapter sequences from high-throughput sequencing reads.
Usage:
cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
For paired-end reads:
cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.
Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.
Citation:
Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
http://dx.doi.org/10.14806/ej.17.1.200
Run "cutadapt --help" to see all command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--debug Print debugging information.
···
但是这个版本不得行
(wes) 17:28:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cutadapt --version
1.18
作者那时候2019年都2.6了,现在会更新
更新一下
(wes) 17:29:28 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
pip3 install --user --upgrade cutadapt
Looking in indexes: http://pypi.douban.com/simple
更新失败了,算了算了,我的pip总感觉这个环境的有点问题
term_galore运行的参数
2.去接头
这是教程的代码:
cd 1.raw_fq
touch ../3.clean/trimgalore.log
## trim_galore.sh
cat ../tmp | while read id; do
fq1=${id}_1.fastq.gz
fq2=${id}_2.fastq.gz
trim_galore --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../3.clean $fq1 $fq2 >> trimgalore.log 2>&1
done
nohup bash trim_galore.sh &
multiqc ./ -n trim_galore_report -p -i " Trim REPORT OF SRP070662" -o multiqc
nohup fastqc --outdir ../2.qc/post --threads 16 *.fq.gz > ../2.qc/post/fastqc.log 2>&1
multiqc ./ -n post_qc_report -p -i " QC REPORT OF SRP070662" -o multiqc
这是教程的,自己改脚本
cat trim_galore.sh
cd /data1/jiarongf/learning/cancer-WES/0.sra/raw_data
## trim_golore.sh
cat ../../data/case | while read id; do
fq1=${id}_1.fastq.gz
fq2=${id}_2.fastq.gz
trim_galore --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean $fq1 $fq2 > ../../log/trimgalore.log 2>&1
done
这里报错了,试一下跑一个,发现试python的问题,之前创建的wes环境是2的版本,所以跑完之后会报错
Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz
>>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
ERROR: Running in parallel is not supported on Python 2
Cutadapt terminated with exit signal: '256'.
Terminating Trim Galore run, please check error message(s) to get an idea what went wrong...
现在更改已经创建的环境的python版本
conda install python==版本号
但是有一个奇怪的就是再这个环境里面的conda show config是python是3.8.8的,为什么默认的python的版本就是2的,哎可能是创建环境的问题吧
网有点不好,慢慢等待
Proceed ([y]/n)? y
Downloading and Extracting Packages
cutadapt-4.0 | 208 KB | ##################################### | 100%
pbzip2-1.1.13 | 114 KB | ##################################### | 100%
certifi-2021.10.8 | 145 KB | ##################################### | 100%
dnaio-0.8.1 | 76 KB | ##################################### | 100%
python-isal-0.11.1 | 141 KB | ##################################### | 100%
setuptools-62.1.0 | 1.2 MB | ##################################### | 100%
xopen-1.5.0 | 25 KB | ##################################### | 100%
isa-l-2.30.0 | 192 KB | ##################################### | 100%
python-3.8.8 | 26.1 MB | ##################################### | 100%
pip-22.0.4 | 1.5 MB | ##################################### | 100%
python_abi-3.8 | 4 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
下载成功
并且也更新了cutadapt
(wes) 14:46:05 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
cutadapt -v
This is cutadapt 4.0 with Python 3.8.8
Command line parameters: -v
Run "cutadapt --help" to see command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.
cutadapt: error: unrecognized arguments: -v
(wes) 14:46:11 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
python --version
Python 3.8.8
再试着运行一下
(wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
trim_galore --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean case3_techrep_2_WES_1.fastq.gz case3_techrep_2_WES_2.fastq.gz
Number of cores used for trimming: 8
Quality Phred score cutoff: 28
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 3 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 30 bp
Output file(s) will be GZIP compressed
Cutadapt seems to be fairly up-to-date (version 4.0). Setting -j 8
Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz
>>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
10000000 sequences processed
^C
(wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
全部拿去跑试试
这里提醒,每次拿去跑的时候都要删掉之前的,不然不知道新生成的是哪个
(wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
nohup sh trim_galore.sh > trim_galore.sh.log 2>&1 &
(wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
top
好了再后台跑起来了
去看看log有没有问题,这里我生成了两个log,一个是运行sh的log一个是运行trim_galore的log
先去检查sh的log
嗯呢果然很简洁,没啥
在检查另一个log
成功的标志啊,哈哈哈哈哈,下面有再检查每一个adapter了
也要跑很久,趁着跑的这段时间,去学习一下term_galore 发现这里不同的测序平台会有不同的adaper,那么如何知道这批数据是什么平台呢,生成fq文件之后不是做了一次fastqc质控嘛,随意打开一个
就可以看到这个样本的encoding了,
学习trim_galore
教程:使用trim_galore对数据进行质量控制-过滤 - 简书 (jianshu.com)
参数说明
--quality/-q<INT>:设定Phred quality score阈值,默认为Phred 20 切除质量得分低于设定值的序列
--phred33/64:使用ASCII+33/64质量得分作为Phred得分选择-phred33或者-phred64,表示-测序平台使用的Phred quality score。
(需要确认:anger/Illumina 1.9+ encoding为33; Illumina 1.5 encoding为64)
--adapter/-a :输入adapter序列。也可以不输入,Trim Galore!会自动寻找可能性最高的平台对应的adapter。自动搜选的平台三个,也直接显式输入这三种平台,即--illumina、--nextera和--small_rna。
-s/--stringency<INT>:设定可以忍受的前后adapter重叠的碱基数,默认为1(非常苛刻)。可以适度放宽,因为后一个adapter几乎不可能被测序仪读到。
--length<INT>:设定输出reads长度阈值小于设定值会被抛弃,默认值20bp;即小于20bp的被去除。注意,在pe150下,不要设置太高,可以50或36(默认20)
--max_length : 设置长度大于此值被丢弃
-e <ERROR RATE> :默认0.1
--paired:对于双端测序结果,一对reads中,如果有一个被剔除,那么另一个会被同样抛弃,而不管是否达到标准。
--retain_unpaired:对于双端测序结果,一对reads中,如果一个read达到标准,但是对应的另一个要被抛弃,达到标准的read会被单独保存为一个文件。
--gzip和--dont_gzip:清洗后的数据zip打包或者不打包。
-o/--output_dir:输入目录 [需要提前建立目录,否则运行会报错]。
-- trim-n : 移除read一端的reads
-j :使用线程数, 注意假设已使用Python 3并安装了Pigz,那么内核设置为4,实际使用内核是15,因此,最高设为4.
--fastqc #当分析结束后,使用默认选项对结果文件进行fastqc分析
fasqc的结果
Per base sequence quality : 每个read各位置碱基的测序质量。
横轴碱基的位置,纵轴是质量分数,
Quality score= -10log10p
(p代表错误率),所以当质量分数为40的时候,p就是0.0001,质量算高了。
[红色线]代表中位数,
[蓝色线]代表平均数
[黄色]是25%-75%区间,
[权]是10%-90%区间。
若任一位置的下 [四分位数] 低于10或者 [中位数] 低于25---->出现 “警告”
若任一位置的下 [四分位数] 低于5或者 [中位数] 低于20,出现“失败,Fail”
跑完啦
昨天跑了一下去接头和质控,质控的脚本:
(wes) 08:38:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat post_fastqc.sh
nohup fastqc --outdir ../2.qc/post --threads 16 ../3.clean/*.fq.gz > ../log/post.fastqc.log 2>&1 &
multi一下
(wes) 08:41:12 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat post_qc_multiqc.sh
multiqc /data1/jiarongf/learning/cancer-WES/2.qc/post -n post_qc_report -p -i " QC REPORT OF SRP070662" -o /data1/jiarongf/learning/cancer-WES/2.qc/post/
(wes) 08:41:21 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh post_qc_multiqc.sh
/// MultiQC 🔍 | v1.13.dev0
| multiqc | Report title: QC REPORT OF SRP070662
| multiqc | Search path : /data1/jiarongf/learning/cancer-WES/2.qc/post
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 120/120
| fastqc | Found 60 reports
| multiqc | Compressing plot data
| multiqc | Report : ../2.qc/post/post_qc_report.html
| multiqc | Data : ../2.qc/post/post_qc_report_data
| multiqc | Plots : ../2.qc/post/post_qc_report_plots
| multiqc | MultiQC complete
(wes) 08:42:07 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
结果分析:
有看到dup的比例确实有减少
M seqs也有相应的降低
看到质量有明显的提升