PAS数据clean 使用命令
cd /data5/Cleanreads/seq_20181119up_1210down_Xten/result/raw && zcat 1_arabidopsis-WT_3-PAS.fq.gz | /public/software/exec/ActivePython-2.7.8.10/bin/cutadapt -a AGATCGGAAGAGC -m 17 --label raw -O 3 -e 0.1 -|fastx_trimmer -Q 33 -f 4 |fastq_quality_trimmer -Q 33 -t 20 -l 16 | fastq_quality_filter -Q 33 -q 20 -p 70 | /public/software/exec/ActivePython-2.7.8.10/bin/cutadapt -a N -m 16 -O 1 -N - | /public/software/exec/ActivePython-2.7.8.10/bin/cutadapt -a GGGGGGGGGG -O 1 -e 0.3 -m 16 -n 5 - > 1_arabidopsis-WT_3-PAS.fq.clean_fq && fastqc 1_arabidopsis-WT_3-PAS.fq.clean_fq && rm -rf 1_arabidopsis-WT_3-PAS.fq.clean_fq_fastqc.zip &&
cd /data5/Cleanreads/seq_20181119up_1210down_Xten/result/raw && zcat 2_arabidopsis-WT_3-PAS.fq.gz | /public/software/exec/ActivePython-2.7.8.10/bin/cutadapt -a AGATCGGAAGAGC -m 20 --label raw -O 3 -e 0.1 -| fastx_trimmer -Q 33 -t 3|fastq_quality_trimmer -Q 33 -t 20 -l 16|fastq_quality_filter -Q 33 -q 20 -p 70 |/public/software/exec/ActivePython-2.7.8.10/bin/cutadapt -a N -m 16 -O 1 -N - | /public/software/exec/ActivePython-2.7.8.10/bin/cutadapt -a GGGGGGGGGG -O 1 -e 0.3 -m 16 -n 5 - > 2_arabidopsis-WT_3-PAS.fq.clean_fq && fastqc 2_arabidopsis-WT_3-PAS.fq.clean_fq && rm -rf 2_arabidopsis-WT_3-PAS.fq.clean_fq_fastqc.zip &&
涉及软件:
cutadapt
参数解释:
-a --adapter=ADAPTER
Sequence of an adapter that was ligated to the 3' end.The adapter itself and anything that follows is trimmed. If the adapter sequence ends with the '$' character, the adapter is anchored to the end of the read and only found if it is a suffix of the read.
-m LENGTH, --minimum-length=LENGTH
Discard trimmed reads that are shorter than LENGTH. Reads that are too short even before adapter removal are also discarded. In colorspace, an initial primer is not counted (default: 0).
-O LENGTH, --overlap=LENGTH
Minimum overlap length. If the overlap between the read and the adapter is shorter than LENGTH, the read is not modified. This reduces the no. of bases trimmed purely due to short random adapter matches (default: 3).
-e 最大错配比例,比如cutadapt在某条序列上检测的接头有15bp长,那么允许这个匹配上的15bp接头中有15*0.1约为1个碱基的错配
-m --minimum-length 切除接头后的序列长度的最小值
-O --overlap 默认必须至少有3个碱基匹配时才会认为是adapter序列,但有时可以适当的调大
--discard-trimmed 去除掉有检测到接头的序列(默认cutadapt只是截掉接头序列以及接头序列以后的序列)
--untrimmed-output 将没有接头的序列输出到目标文件中(但是必须要跟-o 一起用)
--untrimmed-paired-output 将没有接头的paired序列输出到目标文件中(也要跟-p 一起用)
--pair-filter=(any|both) 这个参数很好用,对于双端测序而言,read1和read2都有可能检测到接头。如果选择any,则只要两个中其中一个检测到接头,read1和read2均舍弃;如果选择both,则必须两个都检测到接头,read1和read2才舍弃
fastx_trimmer [-h] [-f N] [-l N] [-t N] [-m MINLEN] [-z] [-v] [-i INFILE] [-o OUTFILE]从3'开始到5'哪些部分保留
[-f N] = 从第几个碱基开始保留,默认第一个
[-l N] = 后面从第几个碱基开始保留,默认全部碱基都保留.
[-t N] =序列尾部修剪掉N个碱基.
[-m MINLEN] = 修剪掉长度小于MINLEN的序列.
fastq_quality_trimmer [-h] [-v] [-t N] [-l N] [-z] [-i INFILE] [-o OUTFILE] 修剪reads的末端
[-t N] = 从5'端开始,低与N的质量的碱基将被修剪掉
[-l N] = 修建之后的reads的长度允许的最短值
[-z] = 压缩输出
[-v] =详细-报告序列编号,如果使用了-o则报告会直接在STDOUT,如果没有则输入到STDERR
3. fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]过滤低质量序列
[-q N] = 最小的需要留下的质量值
[-p N] = 每个reads中最少有百分之多少的碱基需要有-q的质量值
[-z] =压缩输出
[-v] =详细-报告序列编号,如果使用了-o则报告会直接在STDOUT,如果没有则输入到STDERR