Quality Control of Sequence Reads
Understanding the Illumina sequencing technology
Unmapped read data(FASTQ)
Fastq format evolved from Fasta in that it contains sequence data and quality information. There are two main kinds of quality scoring system: Phred33 and Phred 64, differing by offset in the ASCII table. Figure3 provides the mapping of quality encoding characters of Phred33.
Each quality score represents the probability that the corresponding nucleotide call is incorrect.This quality score is logarithmically based and is calculated as:
where P is the probability that a base call is erroneous.
Assessing quality with FastQC
FastQC is a widely used tool in quality control.The main functions of FastQC are:
* Import of data from BAM, SAM or Fastq files (any variant)
* Providing a quick overview to tell you in which areas there may be problems
* Summary graphs and tables to quickly assess your data
* Export of results to an HTML based permanent report
* Offline operation to allow automated generation of reports without running the interactive application
Among all the results of FastQC, "Per base sequence quality" plot is the most important analysis module in FastQC for ChIP-Seq. It provides the distribution of quality scores across all bases at each position in the reads. This information can help determine whether there were any problems at the sequencing facilit during sequencing. Generally, we expect a decrease in quality towards the ends of the reads, but we shouldn't see any quality drops at the beginning or in the middle of the reads.
PS: In my mind, the adapter module is important, as well.