详细的教程官方已经给出。
这里记录自己常用的方法:
安装方法:用Python3安装就可以使用多核参数。
sudo python3 -m pip install --user --upgrade cutadapt
什么是 3’接头,就是一段序列之后跟了adapter。 XXXXXXXXXXXXXXadapter
什么是 5’接头,就是adapter在序列开始。 adapterXXXXXXXXXXXXXX
假如说我的情况属于第一种。就使用-a参数,接头和随后的序列将都被trim掉。
属于第二种,就使用-g参数,接头和接头之前的序列都被trim掉。
默认adapter的错误率为10%,通过-e参数修改。结果文件非压缩。
举例:
cutadapt -a adapter=ATATCCAGAACCCTGACCCTGCCGTGTACCAGCTGAC -O 10 -o G18E2L2_R1.p1.fq -r R1.p2.fq --info-file=R1.cutadapt.log /your/fastq/fastq_1.fq.gz > R1.cutadapt.stats
cutadapt -g adapter=CACAGCGACCTCGGGTGGGAACACCTTGTTCAGGTCT -O 10 -o G18E2L2_R2.p1.fq -r R2.p2.fq --info-file=R2.cutadapt.log /your/fastq/fastq_2.fq.gz > R2.cutadapt.stats
-O --overlap=MINLENGTH : Require MINLENGTH overlap between read and adapter for an adapter to be found. Default: 3
-o output.fastq
-r FILE, --rest-file=FILE When the adapter matches in the middle of a read, write the rest (after the adapter) to FILE.
--info-file=FILE Write information about each read and its adapter matches into FILE. See the documentation for the file format.
-j CORES, --cores=CORES Number of CPU cores to use. Use 0 to auto-detect. Default: 1 python2 下不能使用多核。
-a ADAPTER, --adapter=ADAPTER Sequence of an adapter ligated to the 3' end (paired data: of the first read). The adapter and subsequent bases are trimmed. If a '$' character is appended
('anchoring'), the adapter is only found if it is a suffix of the read.
-g ADAPTER, --front=ADAPTER Sequence of an adapter ligated to the 5' end (paired data: of the first read). The adapter and any preceding bases are trimmed. Partial matches at the 5'
end are allowed. If a '^' character is prepended ('anchoring'), the adapter is only found if it is a prefix of the read.
-b ADAPTER, --anywhere=ADAPTER Sequence of an adapter that may be ligated to the 5' or 3' end (paired data: of the first read). Both types of matches as described under -a and -g are allowed.
If the first base of the read is part of the match, the behavior is as with -g, otherwise as with -a. This option is mostly for rescuing failed library preparations
- do not use if you know which end your adapter was ligated to!
模糊匹配或容错:
-e RATE, --error-rate=RATE Maximum allowed error rate as value between 0 and 1 (no. of errors divided by length of matching region). Default: 0.1 (=10%)
For paired-end reads:
cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
参数:-O MINLENGTH, --overlap=MINLENGTH
Require MINLENGTH overlap between read and adapter for an adapter to be found.
Default: 3
-r:表示将截掉的序列保存在R2.p2.fq文件中。
--info-file:输出log文件。
stat文件是记录adapter的详细过程,最好像我一样重定向到一个文件方便日后查看。默认屏幕输出。
cutadapt结果默认会trim掉adapter和adapter之后(3'的话是之前)的序列,所以,如果你只想切掉adapter,想保留adapter之前和之后的序列,那么就需要从log文件中提取出序列来了。
cutadapt结果log文件处理:
log文件格式是以下这样子的。
这里面存储着三种类型的格式。
实用脚本1:
将cutadapt 生成的log 中的adapter前后的reads分别输出不同的文件中备用。
就是可以将adapter两端的reads分别输出到p1,和p2文件中。
用法:脚本自己写的,很实用!
python deal_cutadapt_log.py -l xxx.cutadapt.log -d /result/dir/
就会得到
xxx.p1.fq 和 xxx.p2.fq
两个文件,代表着adapter之前序列和adapter之后序列。
-f 参数还可以选择保留或者删除log文件中没有adapter 的序列。
usage: deal_cutadapt_log.py [-h] -l LOG_FILE [-d RESULT_DIR] [-f] [-v]
This is description
optional arguments:
-h, --help show this help message and exit
-l LOG_FILE, --log LOG_FILE
input read1 file
-d RESULT_DIR, --dir RESULT_DIR
input read2 file
-f, --flag means to contains -l flag in output.
-v, --version show program's version number and exit
实用脚本2:
批量统计cutadapt.stats文件信息:输入为路径,就会统计该路径下的所有stats文件中的相关信息。
python statistic_basic_info.py ./
sample Total reads processed Reads with adapters
G34E3L1 10,934,616 10,455,685 (95.6%)
非常好用。
点赞送脚本!