转录组入门(3)：了解fastq测序数据

前言

需要用安装好的sratoolkit把sra文件转换为fastq格式的测序文件，并且用fastqc软件测试测序文件的质量！作业，理解测序reads，GC含量，质量值，接头，index，fastqc的全部报告，搜索中文教程，并发在论坛上面。

数据处理

高通量测序产生的海量数据都是经过压缩再上传的，目前比sra更好的压缩方式也正在研究中。首先把sra文件转换成人可读的fastq格式：

cd /mnt/e/0ngs    #数据存放目录
ls *sra |while read id; do fastq-dump  --gzip --split-3 $id; done

fastq-dump用法

--gzip 输出gz压缩格式 --split-3 对PE reads使用

fastq文件介绍

首先看下fastq数据前几行了解数据大概内容。因为是PE测序，所以两个文件都分别看下zcat SRR3589959_1.fastq.gz |head -n 8和zcat SRR3589959_2.fastq.gz |head -n 8。

1503569536378.png

可以看出fastq数据每条read的记录由4行组成：

序列标识以及相关的描述信息，以‘@’开头；
第二行是序列
第三行以‘+’开头，后面加第一行的内容，或者什么也不加
第四行，ASCII对应的第二行每个碱基的质量信息(Sanger/Illumina 1.9 对应 phred33)。

PS: 关于第一行的标识符

1503571678449.png

Illumina sequence identifiers before v1.8：

@HWUSI-EAS100R:6:73:941:1973#0/1

其中

HWUSI-EAS100R 设备名

6 flowcell lane（流动槽泳道号）

73 tile number within the flowcell lane（泳道区块号）

941 ‘x’-coordinate of the cluster within the tile（区块上x坐标）

1973 ‘y’-coordinate of the cluster within the tile（区块上y坐标）

#0 index number for a multiplexed sample (0 for no indexing)

/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Illumina sequence identifiers after v1.8

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

ID	Description
EAS139	the unique instrument name
136	the run id（）
FC706VJ	the flowcell id（）
2	flowcell lane
2104	tile number within the flowcell lane
15343	‘x’-coordinate of the cluster within the tile
197393	‘y’-coordinate of the cluster within the tile
1	the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y	Y if the read fails filter (read is bad), N otherwise
18	0 when none of the control bits are on, otherwise it is an even number
ATCACG	index sequence

NCBI Sequence Read Archive

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

序列质控

ls *.fastq.gz |xargs fastqc -t 6

结果如下：

1503577769616.png

其中绿色表示检测通过，黄色为警告，红色为未通过。如图Per base sequence content因为前15个碱基分布异常而未通过检测，可能存在序列污染或者接头没去干净。一般mRNA测序数据的碱基分布都是比较均一平行的，而ChIP-seq、RIP-seq则可能出现比较大的碱基分布偏好。
根据最后三项检测可以进一步分析是否有污染或者没去干净的接头序列存在。

1503647245130.png