hbctraining-Introduction to ChIP-Seq Lesson 2

Quality Control of Sequence Reads

Understanding the Illumina sequencing technology

Unmapped read data(FASTQ)

      Fastq format evolved from Fasta in that it contains sequence data and quality information. There are two main kinds of quality scoring system: Phred33 and Phred 64, differing by offset in the ASCII table. Figure3 provides the mapping of quality encoding characters of Phred33.

figure 1. fastq format
figure 2. quality scoring system
figure 3. Phred33

    Each quality score represents the probability that the corresponding nucleotide call is incorrect.This quality score is logarithmically based and is calculated as:

Q = -10\times lg(P) where P is the probability that a base call is erroneous.

Assessing quality with FastQC

FastQC is a widely used tool in quality control.The main functions of FastQC are:

* Import of data from BAM, SAM or Fastq files (any variant)

* Providing a quick overview to tell you in which areas there may be problems

* Summary graphs and tables to quickly assess your data

* Export of results to an HTML based permanent report

* Offline operation to allow automated generation of reports without running the interactive application

        Among all the results of FastQC, "Per base sequence quality" plot is the most important analysis module in FastQC for ChIP-Seq. It provides the distribution of quality scores across all bases at each position in the reads. This information can help determine whether there were any problems at the sequencing facilit during sequencing. Generally, we expect a decrease in quality towards the ends of the reads, but we shouldn't see any quality drops at the beginning or in the middle of the reads.

a good quality sample
a not-so-good quality sample

PS: In my mind, the adapter module is important, as well.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

友情链接更多精彩内容