以A01样本为例:
1_rawdata/A01_R1.fq.gz 1_rawdata/A01_R2.fq.gz
经过cutadapt后得到
tempR1 = 'A01_cut_R1.fq' tempR2 = 'A01_cut_R2.fq'
# fastp
# 输入文件: tempR1 = '1_rawdata/A01_cut_R1.fq' tempR2 = '1_rawdata/A01_cut_R2.fq'
# 输出文件: out = '2_cleandata/A01*'
fastp = """
/software/fastp/fastp-master/fastp -i {1} -I {2} \
-o {3}_clean_R1.fq -O {3}_clean_R2.fq \
-l 50 -g -W 5 \
-A -w 8 -5 \
-j {3}_fastp.json -h {3}_fastp.html -s 10
"""
tempR1 = 'A01_cut_R1.fq'
tempR2 = 'A01_cut_R2.fq'
out = '2_cleandata/A01'
comm2 = fastp.format(0, tempR1, tempR2, out)
# '/software/fastp/fastp-master/fastp \
# -i 1_rawdata/A01_cut_R1.fq \
# -I 1_rawdata/A01_cut_R2.fq \
# -o 2_cleandata/A01_clean_R1.fq \
# -O #2_cleandata/A01_clean_R2.fq \
# -l 50 -g -W 5 -A -w 8 -5 \
# -j 2_cleandata/A01_fastp.json -h 2_cleandata/A01_fastp.html \
# -s 10'
popen(comm2).read()
- fastp 执行结果
>>> popen(comm2).read()
Read1 before filtering:
total reads: 109205
total bases: 16271850
Q20 bases: 15778246(96.9665%)
Q30 bases: 15067666(92.5996%)
Read2 before filtering:
total reads: 109205
total bases: 16271850
Q20 bases: 15607406(95.9166%)
Q30 bases: 14743647(90.6083%)
Read1 after filtering:
total reads: 105506
total bases: 15715013
Q20 bases: 15353976(97.7026%)
Q30 bases: 14692223(93.4916%)
Read2 aftering filtering:
total reads: 105506
total bases: 15712636
Q20 bases: 15278227(97.2353%)
Q30 bases: 14483473(92.1772%)
Filtering result:
reads passed filter: 211012
reads failed due to low quality: 7120
reads failed due to too many N: 32
reads failed due to too short: 246
Duplication rate: 0.669288%
Insert size peak (evaluated by paired-end reads): 267
JSON report: 2_cleandata/A01_fastp.json
HTML report: 2_cleandata/A01_fastp.html
/software/fastp/fastp-master/fastp -i 1_rawdata/A01_cut_R1.fq -I 1_rawdata/A01_cut_R2.fq -o 2_cleandata/A01_clean_R1.fq -O 2_cleandata/A01_clean_R2.fq -l 50 -g -W 5 -A -w 8 -5 -j 2_cleandata/A01_fastp.json -h 2_cleandata/A01_fastp.html -s 10
fastp v0.20.0, time used: 2 seconds
- 结果文件
由于在命令中有一个参数-s/--split 10
,所以会得到split成10份后的处理结果。
[Mb18@login TESTA01]$ ll -t 2_cleandata/
total 76072
-rw-r--r-- 1 Mb18 mb 3888374 May 12 14:54 0001.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 3887372 May 12 14:54 0001.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 3898490 May 12 14:54 0002.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 3898254 May 12 14:54 0002.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 3882417 May 12 14:54 0003.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 3881973 May 12 14:54 0003.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 3898367 May 12 14:54 0004.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 3897963 May 12 14:54 0004.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 3891595 May 12 14:54 0005.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 3891717 May 12 14:54 0005.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 3896737 May 12 14:54 0006.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 3896309 May 12 14:54 0006.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 3900485 May 12 14:54 0007.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 3899737 May 12 14:54 0007.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 3893412 May 12 14:54 0008.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 3892842 May 12 14:54 0008.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 1767806 May 12 14:54 0009.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 1767612 May 12 14:54 0009.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 5736744 May 12 14:54 0010.A01_clean_R1.fq
-rw-r--r-- 1 Mb18 mb 5735894 May 12 14:54 0010.A01_clean_R2.fq
-rw-r--r-- 1 Mb18 mb 474717 May 12 14:54 A01_fastp.html
-rw-r--r-- 1 Mb18 mb 125054 May 12 14:54 A01_fastp.json
- A01_fastp.json
{
"summary": {
"before_filtering": {
"total_reads":218410,
"total_bases":32543700,
"q20_bases":31385652,
"q30_bases":29811313,
"q20_rate":0.964416,
"q30_rate":0.916039,
"read1_mean_length":149,
"read2_mean_length":149,
"gc_content":0.499539
},
"after_filtering": {
"total_reads":211012,
"total_bases":31427649,
"q20_bases":30632203,
"q30_bases":29175696,
"q20_rate":0.97469,
"q30_rate":0.928345,
"read1_mean_length":148,
"read2_mean_length":148,
"gc_content":0.496156
}
},
"filtering_result": {
...
},
"duplication": {
...
},
"insert_size": {
...
},
"read1_before_filtering": {
...
},
"read1_after_filtering": {
...
},
...
-
fastp.report.html
- 对结果进行处理
f = open("2_cleandata/A01_fastp.json").read()
ss = eval(f)
hq_r = ss["summary"]["after_filtering"]["total_reads"]
hq_b = ss["summary"]["after_filtering"]["total_bases"]
with open("2_cleandata/A01_summary2.xls", "w") as out2:
out2.write("SampleID\tLib\tHQ Reads(%)\tHQ Data(bp)\tHQ Data(%)\n")
out2.write("%s\tSPE\t%.2f\t%d\t%.2f\n" % ('A01', 1.0*hq_r/summary[0]*100, hq_b, 1.0*hq_b/summary[1]*100))
- 统计结果
统计过滤后的reads占比
[login TESTA01]$ less -S 2_cleandata/A01_summary2.xls
SampleID Lib HQ Reads(%) HQ Data(bp) HQ Data(%)
A01 SPE 96.57 31427649 95.89
- 最后,将拆开的10个结果合并,用于后续分析
cat 2_cleandata/[0-9][0-9][0-9][0-9].${sample}_clean_R1.fq > 2_cleandata/${sample}_clean_R1.fq &
cat 2_cleandata/[0-9][0-9][0-9][0-9].${sample}_clean_R2.fq > 2_cleandata/${sample}_clean_R2.fq &