拿到NGS全基因组下机序列以后肯定是Fastqc+Cutadapt+Trimmomatic去引物序列,匹配序列对原数据进行一波操作猛如虎的过滤。然而这个需要多次读取和写出数据,生产效率很低。所以在此推荐一款集成这三款工具功能于一体的更加智能化的工具fastp。
fastp不仅可以自动识别fastq数据里的引物,匹配序列,还能自动识别数据是single end还是pair end支持长/短read序列。常用测序平台的引物和匹配序列fastp都会自动识别不需要手动指定。并且还能自动识别读序错误进行删除。计算效率是fastqc的2~5倍。
主要特长
引用一下原文:
- filter out bad reads (too low quality, too short, or too many N...)
- cut low quality bases for per read in its 5' and 3' by evaluating the mean quality from a sliding window (like Trimmomatic but faster).
- trim all reads in front and tail
- cut adapters. Adapter sequences can be automatically detected,which means you don't have to input the adapter sequences to trim them.
- correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
- trim polyG in 3' ends, which is commonly seen in NovaSeq/NextSeq data. Trim polyX in 3' ends to remove unwanted polyX tailing (i.e. polyA tailing for mRNA-Seq data)
- preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name.
- report JSON format result for further interpreting.
- visualize quality control and filtering results on a single HTML page (like FASTQC but faster and more informative).
- split the output to multiple files (0001.R1.gz, 0002.R1.gz...) to support parallel processing. Two modes can be used, limiting the total split file number, or limitting the lines of each split file.
- support long reads (data from PacBio / Nanopore devices).
安装
可以git获取,也可以conda安装。
# git
git clone [https://github.com/OpenGene/fastp.git](https://github.com/OpenGene/fastp.git)
cd fastp
make
sudo make install
#bioconda
conda install -c bioconda -y fastp
软件运行
默认的功能里面包含了Quality filtering、Length filtering、Low complexity filter、Adapter trimming。
fastp -i single.fq -o cleaned.fq.gz -w 3 -q 15 -n 10
-i read1 input file name (string)
-o read1 output file name (string [=])
-w worker thread number, default is 3 (int [=3])
-q the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
-u how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
-n if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
也可以是pair end。同时输出html和json格式的结果报告。剪掉tail末端的一个序列。删除20bp以下的序列,CPU16线程。
fastp -i pair1.fq -I pair2.fq -3\
-o out_pair1.fq.gz -O out_pair2.fq.gz\
-h report.html -j report.json -q 15 -n 10 -t 1 -T 1 -l 20 -w 16
- -I read1 input file name (string)
- -I read2 input file name (string [=])
- -o read1 output file name (string [=])
- -O read2 output file name (string [=])
- -3 enable per read cutting by quality in tail (3'), default is disabled (WARNING: this will interfere deduplication for SE data)
- -w worker thread number, default is 3 (int [=3])
- -n if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
- -t trimming how many bases in tail for read1, default is 0 (int [=0])
- -T trimming how many bases in tail for read2, default is 0 (int [=0])
- -A adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
- -l reads shorter than length_required will be discarded, default is 15. (int [=15])
运行完成以后你可以看到结果的报告。
序列质量分布
碱基含量
k-mer的overrepresentation分析
操作非常简单,妈妈再也不用担心我不会fastq前处理了。
引用
fastp: an ultra-fast all-in-one FASTQ preprocessor
Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu
Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890,
fastp: an ultra-fast all-in-one FASTQ preprocessor
Shifu Chen1,2,*, Yanqing Zhou1, Yaru Chen1, Jia Gu
bioRxiv preprint first posted online Mar. 1, 2018;
doi: http://dx.doi.org/10.1101/274100.
PDF
https://www.biorxiv.org/content/biorxiv/early/2018/03/01/274100.full.pdf