Step four:
对从公司送回来的原始数据raw data预处理,至于问什么要进行这一步,通俗点讲就像是炒菜,直接从地里拔出来的青菜是不能直接用的,而正规的程序,至少要洗洗切切;而洗菜得用盆,切菜得用刀,所以要处理这些raw data也得要找点工具:首先得有个切菜的桌子放菜,而我们要放数据而且是大数据,一般的电脑是肯定不行的,最好是台服务器,而且要台扩大内存的服务器,后期我会讲一讲自己理解的该怎那么配一台能做生信的电脑;其次,就是菜刀了,刀的种类有很多,这里的工具(软件)也是多种多样:fastqc、fastp、bwa、samtools、vcftools、gatk、picard、plink~~~~~这里的是我了解的一小部分,据统计这样的工具现在有数千种,但是步骤就这些,它们的功能也都是大同小异,同一功能软件的差异我以为是侧重点有所不同,这个需要细品,但是大众化的也就我列举的这几个,反正对于我这样的小白白是够了,我也不想去细品了。
data 进行数据质量评估,也就是“看菜”,读书人叫他“QC”(quality control),至于问什么这么叫我觉着它就是想让像我们这样的小白白第一次看的时候看不懂。
测序数据的基本信息:也就是basic statistics
首先敲一下fastqc –h看装好了吗!
FastQC - A high throughputsequence QC analysis tool
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-ffastq|bam|sam]
[-c contaminant file] seqfile1 ..seqfileN
FastQC reads a set of sequence files andproduces from each one a quality
control report consisting of a number ofdifferent modules, each one of
which will help to identify a differentpotential type of problem in your
If no files to process are specified on thecommand line then the program
will start as an interactive graphicalapplication. If files are provided
on the command line then the program willrun with no user interaction
required. In this mode it is suitable for inclusion into a standardised
analysis pipeline.
The options for the program as as follows:
-h --help Print this help file and exit
-v --version Print the version of the program and exit
-o --outdir Create all output files in the specifiedoutput directory.
Please note that thisdirectory must exist as the program
will not create it. If this option is not set then the
output file for eachsequence file is created in the same
directory as the sequencefile which was processed.
--casava Files come from raw casava output.Files in the same sample
group (differing only bythe group number) will be analysed
as a set rather thanindividually. Sequences with the filter
flag set in the header willbe excluded from the analysis.
Files must have the same names given tothem by casava
(including being gzippedand ending with .gz) otherwise they
won't be grouped togethercorrectly.
--nano Files come from nanopore sequencesand are in fast5 format. In
this mode you can pass indirectories to process and the program
will take in all fast5files within those directories and produce
a single output file fromthe sequences found in all files.
--nofilter If running with --casava then don'tremove read flagged by
casava as poor quality whenperforming the QC analysis.
--extract If set then the zipped output file willbe uncompressed in
the same directory after ithas been created. By default
this option will be set iffastqc is run in non-interactive
-j --java Provides the full path to the javabinary you want to use to
launch fastqc. If notsupplied then java is assumed to be in
your path.
--noextract Do not uncompress the output file aftercreating it. You
should set this option ifyou do not wish to uncompress
the output when running innon-interactive mode.
--nogroup Disable grouping of bases for reads>50bp. All reports will
show data for every base inthe read. WARNING: Using this
option will cause fastqc tocrash and burn if you use it on
really long reads, and your plots may end upa ridiculous size.
You have been warned!
--min_length Sets an artificial lower limit on thelength of the sequence
to be shown in the report. As long as you set this to a value
greater or equal to yourlongest read length then this will be
the sequence length used tocreate your read groups. This can
be useful for makingdirectly comaparable statistics from
datasets with somewhatvariable read lengths.
-f --format Bypasses the normal sequence file formatdetection and
forces the program to usethe specified format. Valid
formats arebam,sam,bam_mapped,sam_mapped and fastq
-t --threads Specifies the number of files which can beprocessed
simultaneously. Each thread will be allocated 250MB of
memory so you shouldn't run morethreads than your
available memory will copewith, and not more than
6 threads on a 32 bitmachine
-c Specifies a non-default file whichcontains the list of
--contaminants contaminants to screen overrepresentedsequences against.
The file must contain setsof named contaminants in the
formname[tab]sequence. Lines prefixed with ahash will
be ignored.
-a Specifies a non-default filewhich contains the list of
--adapters adapter sequences which will be explicitysearched against
the library. The file mustcontain sets of named adapters
in the formname[tab]sequence. Lines prefixed with ahash
will be ignored.
-l Specifies a non-default filewhich contains a set of criteria
--limits which will be used to determine thewarn/error limits for the
various modules. This file can also be used to selectively
remove some modules fromthe output all together. The format
needs to mirror the defaultlimits.txt file found in the
Configuration folder.
-k --kmers Specifies the length of Kmer to look forin the Kmer content
module. Specified Kmer lengthmust be between 2 and 10. Default
length is 7 if notspecified.
-q --quiet Supress all progress messages on stdoutand only report errors.
-d --dir Selects a directory to be used for temporaryfiles written when
generating report images.Defaults to system temp directory if
not specified.
Any bugs in fastqc should be reportedeither to
fastq-o ./ ../reads/example1.*
-c:用来指定一个contaminant文件,fastqc会把overrepresented sequences往这个contaminant文件里搜索。
加上 -q 会进入沉默模式,(这个就没什么必要)
在我看来了解-o 、-f就可以操作了
[hai@localhost~]$ cd ~/proj1/fastqc/
[hai@localhost fastqc]$ fastqc -f fastq -o ./../reads/example1.*
Fastp:fastp开发为具有有用的质量控制和数据过滤功能的超快速FASTQ预处理器。只需扫描FASTQ数据,它即可执行质量控制,适配器修整,质量过滤,每次读取质量修剪和许多其他操作。该工具是用C ++开发的,具有多线程支持。根据我们的评估,fastp比其他FASTQ预处理工具(如Trimmomatic或Cutadapt)快2至5倍,尽管执行的操作要比类似工具多得多。
对数据自动进行全方位质控,生成人性化的报告。过滤功能(低质量,太短,太多N……)。对每一个序列的头部或尾部,计算滑动窗内的质量均值,并将均值较低的子序列进行切除(类似Trimmomatic的做法,但是快非常多)。全局剪裁 (在头/尾部,不影响去重),对于Illumina下机数据往往最后一到两个cycle需要这样处理。去除接头污染。厉害的是,你不用输入接头序列,因为算法会自动识别接头序列并进行剪裁。对于双端测序(PE)的数据,软件会自动查找每一对read的重叠区域,并对该重叠区域中不匹配的碱基对进行校正。去除尾部的polyG。对于Illumina
-i, --in1 R1文件输入;
-I, --in2 R2文件输入;
-o, --out1 R1文件处理后的输出;
-O, --out2 R2文件处理后的输出;
-h, --html 设置输出html格式的质控结果文件名,不设置则默认html文件名为fastp.html
-j, --json 设置输出html格式的质控结果文件名,不设置则默认json文件名为fastp.json
fastp -i in.fq -o out.fq
fastp -i in.R1.fq -o out.R1.fq -I in.R2.fq -O out.R2.fq
fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz
Fstqc –f fastq –o ./ ../data/reads/example.*