FASTQ files 科普信息搬运

I am an idiot in the biology field, therefore the common sense for the professional guys might seems like a seald book for me. Therefore, the basic information is needed for myself. 

Again, this information is directly copied from the usearch websit. 


FASTQ files are text files containing sequence data with a quality (Phred) score for each base, represented as an ASCII character. The quality score is an integer (Q) which is typically in the range 2 - 40, but higher and lower values are sometimes used. In particular, versions 1.8 and later of the Illumina platform generate reads with Q scores up to 41.

Unfortunately, the FASTQ format is not standardized. There are several variants in common use, and it is not possible to distinguish them automatically with high reliability. The fastq_chars command can be used to guess the format of an unknown file. See FASTQ format options.


FASTQ file


This is how my data look like 

FASTQ read with 50 base calls in Illumina format (ASCII_BASE=33).

There are always four lines per read. The first line starts with '@', followed by the label.

The third line starts with '+'. In some variants, the '+' line contains a second copy of the label.The fourth line contains the Q scores represented as ASCII characters.


FASTQ format parameters


FASTQ formats

Unfortunately, the FASTQ format is not standardized. There are several variants in common use, and it is not possible to distinguish them automatically with high reliability. The main parameters are the minimum and maximum Q scores and the ASCII_BASE constant.

The fastq_chars command can be used to guess the format of a FASTQ file.

ASCII coding of Q scores

The Q value is coded as a printable ASCII character using Q = ASCII_CODE - ASCII_BASE. Here, ASCII_CODE is the ASCII code for the character and ASCII_BASE is a constant. The original Sanger FASTQ format used ASCII_BASE = 33 so for example if the quality score is coded as 'C' then Q = ASCII_CODE('C') - 33 = 67 - 32 = 35. See here for tables mapping ASCII characters to Q scores for common variants of FASTQ.




FASTA format 维基百科

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a near universal standard in the field of bioinformatics.[4]

The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like the R programming languagePythonRuby, and Perl.


最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,449评论 0 10
  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,934评论 0 23
  • Introduction What is Bowtie 2? Bowtie 2 is an ultrafast a...
    wzz阅读 5,816评论 0 5
  • 这是第一天来培训拍的照片,42天的新教师培训明天就结束了,有很多想说的话和感悟。 这也是我在写作路上的第一篇文章,...
    子君记阅读 295评论 0 1
  • 前几天认识了一位朋友。他有两个特点,一是身材不高大腹便便,长得有点儿像弥勒佛,我说他是弥勒佛的弟弟。他人非常...
    松峰说教刘树森阅读 284评论 0 0