Step four:
对从公司送回来的原始数据raw data预处理,至于问什么要进行这一步,通俗点讲就像是炒菜,直接从地里拔出来的青菜是不能直接用的,而正规的程序,至少要洗洗切切;而洗菜得用盆,切菜得用刀,所以要处理这些raw data也得要找点工具:首先得有个切菜的桌子放菜,而我们要放数据而且是大数据,一般的电脑是肯定不行的,最好是台服务器,而且要台扩大内存的服务器,后期我会讲一讲自己理解的该怎那么配一台能做生信的电脑;其次,就是菜刀了,刀的种类有很多,这里的工具(软件)也是多种多样:fastqc、fastp、bwa、samtools、vcftools、gatk、picard、plink~~~~~这里的是我了解的一小部分,据统计这样的工具现在有数千种,但是步骤就这些,它们的功能也都是大同小异,同一功能软件的差异我以为是侧重点有所不同,这个需要细品,但是大众化的也就我列举的这几个,反正对于我这样的小白白是够了,我也不想去细品了。
我认为那到菜的第一件事就是看看这菜新不新鲜、能不能吃,数据也是这样,看看这个数据合不合格、能不能用。下面就要用到我们的第一把刀了——fastqc,用它看一下“这菜”怎么样?
哎幺幺!忘了和大家说一个前提,做生信的大多数软件最好用Linux系统,其实这个系统也很简单,就和你用惯了Android,现在得用iOS了,多少你的适应一下。其实啊,这里用到的命令也就几条,想了解,后面我也和你说一下。这其实都不难,如果非要说点难点,我觉着就是得了解Linux下每个软件参数怎么用。
噢噢噢噢噢!
没别的意思就是想提提提神,下面就让我们一起了解一下fastqc。
Fastqc是一款基于JAVA的软件,它可以很快的对raw
data 进行数据质量评估,也就是“看菜”,读书人叫他“QC”(quality control),至于问什么这么叫我觉着它就是想让像我们这样的小白白第一次看的时候看不懂。
用fastqc可以了解以下信息
测序数据的基本信息:也就是basic statistics
每个碱基的质量值:只要大多数数据》20就差不多
每条reads序列的质量值:横轴是quality,纵轴是reads的数目,可以大体上了解那一部分质量比较差
每条序列的ATCG组成:A=T,G=C,满足这个碱基互补配对原则
每条序列N的含量:一般不出现什么东西,有东西出现就是有问题了
GC含量分布:一般是两条线,一条蓝的事理论值,一条红的不均匀的线,一般跟小狗啃的样,崎岖不平,但是模样应该和理论值差不多,如果偏差太大就是样本应该是被污染了
每条序列的长度分布:根据图可以大体了解哪一块长度的reads比较多
序列中duplication程度:重复序列是正常的,在提dna时,样品本来就是有重复的序列,因为,我们不是提的单细胞基因,还有在测序是要进行桥式pcr扩增,用以各碱基扩大信号。但是如果duplication过大就说明了有bias的存在。
其实,还有许多信息,但是,我们这里了解这些就够了。而且如果有问题fastqc会用叉号提示你的。
知道了fastqc可以做什么,能得到什么,我们下面就了解一下怎么用它吧!
首先下载它,安装它,然后用它(这部分我补上的)。在我看来Linux的软件的玩法都差不多,知道它的大体用法,了解一下它的参数基本就可以操作了。好!盘它!
首先敲一下fastqc –h看装好了吗!
FastQC - A high throughputsequence QC analysis tool
SYNOPSIS
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-ffastq|bam|sam]
[-c contaminant file] seqfile1 ..seqfileN
DESCRIPTION
FastQC reads a set of sequence files andproduces from each one a quality
control report consisting of a number ofdifferent modules, each one of
which will help to identify a differentpotential type of problem in your
data.
If no files to process are specified on thecommand line then the program
will start as an interactive graphicalapplication. If files are provided
on the command line then the program willrun with no user interaction
required. In this mode it is suitable for inclusion into a standardised
analysis pipeline.
The options for the program as as follows:
-h --help Print this help file and exit
-v --version Print the version of the program and exit
-o --outdir Create all output files in the specifiedoutput directory.
Please note that thisdirectory must exist as the program
will not create it. If this option is not set then the
output file for eachsequence file is created in the same
directory as the sequencefile which was processed.
--casava Files come from raw casava output.Files in the same sample
group (differing only bythe group number) will be analysed
as a set rather thanindividually. Sequences with the filter
flag set in the header willbe excluded from the analysis.
Files must have the same names given tothem by casava
(including being gzippedand ending with .gz) otherwise they
won't be grouped togethercorrectly.
--nano Files come from nanopore sequencesand are in fast5 format. In
this mode you can pass indirectories to process and the program
will take in all fast5files within those directories and produce
a single output file fromthe sequences found in all files.
--nofilter If running with --casava then don'tremove read flagged by
casava as poor quality whenperforming the QC analysis.
--extract If set then the zipped output file willbe uncompressed in
the same directory after ithas been created. By default
this option will be set iffastqc is run in non-interactive
mode.
-j --java Provides the full path to the javabinary you want to use to
launch fastqc. If notsupplied then java is assumed to be in
your path.
--noextract Do not uncompress the output file aftercreating it. You
should set this option ifyou do not wish to uncompress
the output when running innon-interactive mode.
--nogroup Disable grouping of bases for reads>50bp. All reports will
show data for every base inthe read. WARNING: Using this
option will cause fastqc tocrash and burn if you use it on
really long reads, and your plots may end upa ridiculous size.
You have been warned!
--min_length Sets an artificial lower limit on thelength of the sequence
to be shown in the report. As long as you set this to a value
greater or equal to yourlongest read length then this will be
the sequence length used tocreate your read groups. This can
be useful for makingdirectly comaparable statistics from
datasets with somewhatvariable read lengths.
-f --format Bypasses the normal sequence file formatdetection and
forces the program to usethe specified format. Valid
formats arebam,sam,bam_mapped,sam_mapped and fastq
-t --threads Specifies the number of files which can beprocessed
simultaneously. Each thread will be allocated 250MB of
memory so you shouldn't run morethreads than your
available memory will copewith, and not more than
6 threads on a 32 bitmachine
-c Specifies a non-default file whichcontains the list of
--contaminants contaminants to screen overrepresentedsequences against.
The file must contain setsof named contaminants in the
formname[tab]sequence. Lines prefixed with ahash will
be ignored.
-a Specifies a non-default filewhich contains the list of
--adapters adapter sequences which will be explicitysearched against
the library. The file mustcontain sets of named adapters
in the formname[tab]sequence. Lines prefixed with ahash
will be ignored.
-l Specifies a non-default filewhich contains a set of criteria
--limits which will be used to determine thewarn/error limits for the
various modules. This file can also be used to selectively
remove some modules fromthe output all together. The format
needs to mirror the defaultlimits.txt file found in the
Configuration folder.
-k --kmers Specifies the length of Kmer to look forin the Kmer content
module. Specified Kmer lengthmust be between 2 and 10. Default
length is 7 if notspecified.
-q --quiet Supress all progress messages on stdoutand only report errors.
-d --dir Selects a directory to be used for temporaryfiles written when
generating report images.Defaults to system temp directory if
not specified.
BUGS
Any bugs in fastqc should be reportedeither to simon.andrews@babraham.ac.uk
or inwww.bioinformatics.babraham.ac.uk/bugzilla/
然后,你如果得到这些内容,它基本就是好了!如果other,就是有问题,自己查一下解决一下,不过也可以私信,一起探讨一下。
fastq-o ./ ../reads/example1.*
-o:用来指定输出文件的所在目录,注意是不能自动新建目录的。
-f:用来强制指定输入文件格式,默认会自动检测。
-c:用来指定一个contaminant文件,fastqc会把overrepresented sequences往这个contaminant文件里搜索。
contaminant文件的格式是"Name\tSequences",#开头的行是注释。
加上 -q 会进入沉默模式,(这个就没什么必要)
在我看来了解-o 、-f就可以操作了
好!找个文件,尝试一下
[hai@localhost~]$ cd ~/proj1/fastqc/
[hai@localhost fastqc]$ fastqc -f fastq -o ./../reads/example1.*
Started analysis of example1.L.fq
Approx 5% complete for example1.L.fq
Approx 10% complete for example1.L.fq
Approx 15% complete for example1.L.fq
Approx 20% complete for example1.L.fq
Approx 25% complete for example1.L.fq
Approx 30% complete for example1.L.fq
Approx 35% complete for example1.L.fq
Approx 40% complete for example1.L.fq
Approx 45% complete for example1.L.fq
Approx 50% complete for example1.L.fq
Approx 55% complete for example1.L.fq
Approx 60% complete for example1.L.fq
Approx 65% complete for example1.L.fq
Approx 70% complete for example1.L.fq
Approx 75% complete for example1.L.fq
Approx 80% complete for example1.L.fq
Approx 85% complete for example1.L.fq
Approx 90% complete for example1.L.fq
Approx 95% complete for example1.L.fq
Analysis complete for example1.L.fq
Started analysis of example1.R.fq
Approx 5% complete for example1.R.fq
Approx 10% complete for example1.R.fq
Approx 15% complete for example1.R.fq
Approx 20% complete for example1.R.fq
Approx 25% complete for example1.R.fq
Approx 30% complete for example1.R.fq
Approx 35% complete for example1.R.fq
Approx 40% complete for example1.R.fq
Approx 45% complete for example1.R.fq
Approx 50% complete for example1.R.fq
Approx 55% complete for example1.R.fq
Approx 60% complete for example1.R.fq
Approx 65% complete for example1.R.fq
Approx 70% complete for example1.R.fq
Approx 75% complete for example1.R.fq
Approx 80% complete for example1.R.fq
Approx 85% complete for example1.R.fq
Approx 90% complete for example1.R.fq
Approx 95% complete for example1.R.fq
Analysis complete for example1.R.fq
欧了,点开文件查看一下!
Fastqc基本就是这些,看那个红叉叉就是有问题的,有问题,下面就要解决问题了。
如果再跟做饭联系起来,fastqc的作用就相当于看看这刚拔出来的菜什么地方脏,下面就是要洗洗切切了,下一个工具是fastp,说到这个工具,我不禁感叹世界是多么美好啊。
Fastp:fastp开发为具有有用的质量控制和数据过滤功能的超快速FASTQ预处理器。只需扫描FASTQ数据,它即可执行质量控制,适配器修整,质量过滤,每次读取质量修剪和许多其他操作。该工具是用C ++开发的,具有多线程支持。根据我们的评估,fastp比其他FASTQ预处理工具(如Trimmomatic或Cutadapt)快2至5倍,尽管执行的操作要比类似工具多得多。
具体优势:
对数据自动进行全方位质控,生成人性化的报告。过滤功能(低质量,太短,太多N……)。对每一个序列的头部或尾部,计算滑动窗内的质量均值,并将均值较低的子序列进行切除(类似Trimmomatic的做法,但是快非常多)。全局剪裁 (在头/尾部,不影响去重),对于Illumina下机数据往往最后一到两个cycle需要这样处理。去除接头污染。厉害的是,你不用输入接头序列,因为算法会自动识别接头序列并进行剪裁。对于双端测序(PE)的数据,软件会自动查找每一对read的重叠区域,并对该重叠区域中不匹配的碱基对进行校正。去除尾部的polyG。对于Illumina
NextSeq/NovaSeq的测序数据,因为是两色法发光,polyG是常有的事,所以该特性对该两类测序平台默认打开。对于PE数据中的overlap区间中不一致的碱基对,依据质量值进行校正可以对带分子标签(UMI)的数据进行预处理,不管UMI在插入片段还是在index上,都可以轻松处理。可以将输出进行分拆,而且支持两种模式,分别是指定分拆的个数,或者分拆后每个文件的行数。
好了,说了这么多开盘吧!
参数:
-i, --in1 R1文件输入;
-I, --in2 R2文件输入;
-o, --out1 R1文件处理后的输出;
-O, --out2 R2文件处理后的输出;
-h, --html 设置输出html格式的质控结果文件名,不设置则默认html文件名为fastp.html
-j, --json 设置输出html格式的质控结果文件名,不设置则默认json文件名为fastp.json
其实它还有一些参数,这个根据需要自己查一下稍微改一下就好了,对于小白白来说,一切都是最好的安排,就一切都默认吧!
SE:
fastp -i in.fq -o out.fq
PE:
fastp -i in.R1.fq -o out.R1.fq -I in.R2.fq -O out.R2.fq
#注意大小写
fastp对于输入和输出都支持gzip压缩,只要文件名的末尾带有.gz,就会被认为是gzip压缩文件,会启用gzip对输入输出进行压缩和解压处理。
fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz
除此之外,fastp还会给出一个人性化的报告。为啥人性化自己看吧。
经过处理后,再用fastqc看一下是否都绿了.
Fstqc –f fastq –o ./ ../data/reads/example.*
哈哈哈,全绿了,红叉叉也没有了。
Ok,下一步,序列比对!