关键词：BSA分析，测序数据质控，FastQC，Trimmomatic，去接头序列

数据格式

一定要先搞清楚数据的格式，才能进行进一步的数据整理。
我们拿到测序数据基本是fastq格式的文件，现在一般都是150 PE（150bp, 双端）的reads，数据文件通常也会命名成_1.fq.gz和_2.fq.gz，进行区分（注意双端数据一个样本对应两个数据文件）。也有可能是从公共数据库下载的数据，比如NCBI的SRA格式的数据，可以使用sratoolkit这类软件将格式转换成fastq。
fastq的格式如下：
第一行：描述行（Header Line）。以“@”字符开头，属于数据的标签，包含序列的名称、标识符等测序样本的信息，双端数据还会有类似于"/1"和"/2"（也有“：2“和”：1“这种，具体看数据）这种分类标识。
第二行：序列行（Sequence Data Line），留意序列长度，如果是150 PE的序列，小于150bp的数据我们通常会过滤掉；如果序列长度大于150bp，一般是因为序列的接头未去除。
第三行：分隔行（Separator Line）：以“+”开始，可以储存一些附加信息，一般是空的
第四行：质量分数行（Quality Score Line）：包含与序列数据行中的对应碱基的质量分数。质量分数通常以ASCII字符表示，表示测序仪器对每个碱基的测序质量。较高的ASCII值表示更高的质量，我们一般会去统计Q30碱基的比例，以此作为测序数据的质量好坏的衡量标准之一。
质量值对应表，计算方式可自行百度：

质量值对应表

数据示例：

#sample1_1.fq.gz
@E150015079L1C001R0020000323/1
CATACTGAGTATGGTGATTCAGGATTATTCCTGACGGATGTATCGAATGTTGTGGTGAAAGTGTACCACCTCTGCAGAGTGCCAAACTACTCGAATAACCGCCTCCGTAGTTATGGACAGTTGGAGTGGCCATACTATTCTATCGCCAGA
+
@CCCCDDCCCDCC@CCCCDCCCCBCCBBCCCCCBDCCCCCCCCDCB?CCCCCCDCCCCCCCBBC@CDCCDDCCCCCDA<CDCCCCACCCDDDCCCCCCCDCCDDDDCCCCCCCCC3CDC0CCA?CC?C9@DDCCDCCBCDDCDCCCDDCC

#sample1_2.fq.gz
@E150015079L1C001R0020000323/2
GCAAAAAGCTAGATTTATCTCTAACTGAGTTTAGAAGTAGAGAAAACTAGTTTAATGTCTTTCAACACCTAAGTGGGAACTTAGCTTTCACTTTCATAACCCTGTTGTGATCAAGTCAAATTCATGATTAATTTCACAAGTCATTTTGAA
+
CCCC5?CCCCCCCCCCCDCCCCCBCCC@CCCBCCCCCCCBCCCBCCCCCCACCCCCCCCCCCCCACCDCCC5CCCBC?CCCCC>CCCCCBDCCCCCCBACCCCBCCBCC/CA@@CCCC=CBCCBCCCBC;?CCCDCCACC=DCCCCCCCC

数据质控

FastQC下载安装

如果了解数据格式之后仍旧不确定数据是否去除接头序列以及数据质量好坏，我们可以借助软件的力量，这里就不得不提fq质控常见软件FastQC了。

FastQC软件Windows和Linux两个系统的适用版本都有

FastQC软件下载链接

1、Windos系统

下载后解压缩-->双击"run_fastqc.bat"运行程序--> 点击"File">点击"Open">选择要分析的序列文件-->FastQC的分析界面展示结果-->点击"File">点击"Save report"保存报告

帮助说明文档：点击“Help”>“Contents”

2、Linux系统

还是推荐用Linux，毕竟不受电脑cpu和样本数据量的限制

#确认是否安装好了Java
java -version
#没有就安
sudo apt install default-jre
sudo apt install openjdk-11-jre-headless
sudo apt install openjdk-8-jre-headless
sudo apt install openjdk-9-jre-headless


#安装好之后,解压fastqc,更改权限，执行看是否安装成功
unzip ~/softwarepath/fastqc_v0.12.1.zip -d ~/softwarepath/ 
chmod 777 ~/softwarepath/FastQC/fastqc
~/softwarepath/FastQC/fastqc -h


#懒人方法(推荐)
conda create -n fastqc 
conda activate fastqc 
conda install -c bioconda fastqc -y

#安装好后，写入环境配置文件中
echo 'export PATH=~/softwarepath/FastQC:$PATH' >>~/.bashrc
source ~/.bashrc

FastQC使用方法

fastqc --help
            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

        fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of
    which will help to identify a different potential type of problem in your
    data.

    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.

    The options for the program as as follows:

    -h --help       Print this help file and exit

    -v --version    Print the version of the program and exit

    -o --outdir     Create all output files in the specified output directory.
                    Please note that this directory must exist as the program
                    will not create it.  If this option is not set then the
                    output file for each sequence file is created in the same
                    directory as the sequence file which was processed.

    --casava        Files come from raw casava output. Files in the same sample
                    group (differing only by the group number) will be analysed
                    as a set rather than individually. Sequences with the filter
                    flag set in the header will be excluded from the analysis.
                    Files must have the same names given to them by casava
                    (including being gzipped and ending with .gz) otherwise they
                    won't be grouped together correctly.

    --nano          Files come from nanopore sequences and are in fast5 format. In
                    this mode you can pass in directories to process and the program
                    will take in all fast5 files within those directories and produce
                    a single output file from the sequences found in all files.

    --nofilter      If running with --casava then don't remove read flagged by
                    casava as poor quality when performing the QC analysis.

    --extract       If set then the zipped output file will be uncompressed in
                    the same directory after it has been created. If --delete is
                    also specified then the zip file will be removed after the
                    contents are unzipped.

    -j --java       Provides the full path to the java binary you want to use to
                    launch fastqc. If not supplied then java is assumed to be in
                    your path.

    --noextract     Do not uncompress the output file after creating it.  You
                    should set this option if you do not wish to uncompress
                    the output when running in non-interactive mode.

    --nogroup       Disable grouping of bases for reads >50bp. All reports will
                    show data for every base in the read.  WARNING: Using this
                    option will cause fastqc to crash and burn if you use it on
                    really long reads, and your plots may end up a ridiculous size.
                    You have been warned!

    --min_length    Sets an artificial lower limit on the length of the sequence
                    to be shown in the report.  As long as you set this to a value
                    greater or equal to your longest read length then this will be
                    the sequence length used to create your read groups.  This can
                    be useful for making directly comaparable statistics from
                    datasets with somewhat variable read lengths.

    --dup_length    Sets a length to which the sequences will be truncated when
                    defining them to be duplicates, affecting the duplication and
                    overrepresented sequences plot.  This can be useful if you have
                    long reads with higher levels of miscalls, or contamination with
                    adapter dimers containing UMI sequences.


    -f --format     Bypasses the normal sequence file format detection and
                    forces the program to use the specified format.  Valid
                    formats are bam,sam,bam_mapped,sam_mapped and fastq


    --memory        Sets the base amount of memory, in Megabytes, used to process
                    each file.  Defaults to 512MB.  You may need to increase this if
                    you have a file with very long sequences in it.

    --svg           Save the graphs in the report in SVG format.

    -t --threads    Specifies the number of files which can be processed
                    simultaneously.  Each thread will be allocated 250MB of
                    memory so you shouldn't run more threads than your
                    available memory will cope with, and not more than
                    6 threads on a 32 bit machine

    -c              Specifies a non-default file which contains the list of
    --contaminants  contaminants to screen overrepresented sequences against.
                    The file must contain sets of named contaminants in the
                    form name[tab]sequence.  Lines prefixed with a hash will
                    be ignored.

    -a              Specifies a non-default file which contains the list of
    --adapters      adapter sequences which will be explicity searched against
                    the library. The file must contain sets of named adapters
                    in the form name[tab]sequence.  Lines prefixed with a hash
                    will be ignored.

    -l              Specifies a non-default file which contains a set of criteria
    --limits        which will be used to determine the warn/error limits for the
                    various modules.  This file can also be used to selectively
                    remove some modules from the output all together.  The format
                    needs to mirror the default limits.txt file found in the
                    Configuration folder.

   -k --kmers       Specifies the length of Kmer to look for in the Kmer content
                    module. Specified Kmer length must be between 2 and 10. Default
                    length is 7 if not specified.

   -q --quiet       Suppress all progress messages on stdout and only report errors.

   -d --dir         Selects a directory to be used for temporary files written when
                    generating report images. Defaults to system temp directory if
                    not specified.

主要参数解读：
-o 或 --outdir #FastQC生成的报告文件的输出路径
--extract #生成的报告不打包成压缩文件，默认打包
-t 或 --threads #选择程序运行的线程数
-q 或 --quiet #安静运行模式，一般不选

#生成结果文件名与输入有关。
fastqc path/to/sample1_1.fq.gz path/to/sample1_2.fq.gz … -o 文件夹
#使用fastqc进行质量检测，会当前文件夹生成一个.html网页文件和一个.zip文件

FastQC结果解读

需要重点关注的结果
● Basic Statistics: 数据量概览
● Per base sequence quality：reads单碱基测序质量的直接展示。
● Per sequence quality scores：总体reads测序质量趋势。
● Per base sequence content： ATGC含量估计测序是否存在偏差。
● Sequence Duplication Levels：数据重复情况。
影响测序的因素很多，要排除数据污染，冗余、数据量不足等问题，因此前期数据处理时一定要进行质控。
使用浏览器打开后缀是html的文件查看网页版结果，就是图表化的fastqc报告。

报告目录

整个报告分成11个部分。绿色的√表示合格；黄色的！表示警告；红色的×表示不合格。

基本数据统计

Encoding：表示测序平台版本及对应编码版本号。
Total Sequences：reads总数。
Sequence length：测序长度，范围值。
%GC：表示序列中的GC含量，具有物种特异性，相同物种的GC较为接近。GC含量远远偏离该物种常见值，说明测序数据存在一定偏好性，如果直接使用数据会影响后续变异检测的分析。

序列单碱基的测序质量（本次实验数据碱基质量挺高，因此找了下面的网图）

序列单碱基的测序质量

横轴代表碱基位置；纵轴代表碱基质量值，红色线表示该位置质量值中位数，黄色是25%-75%区间，触须是10%-90%区间，蓝线是质量值平均数。(一般要求所有位置的1/10分位数大于20，否则去除质量值在20以下的碱基，即Q20过滤。Warning，如果某位置碱基质量低于10,或者是某位置碱基质量值中位数低于25;Failure，如果某位置碱基质量低于5,或者是某位置碱基质量值中位数低于20。)

单个tile测序的测序质量

tile：每一次测序荧光扫描的最小单位。横轴代表144个碱基的位置；纵轴是tile的Index编号。(检查reads中每一个碱基位置在不同的测序小孔之间的偏离度，蓝色表示低于平均偏离度，偏离度小，质量好；红色表示偏离平均质量越多，质量也越差。)

序列质量得分的分布情况

reads的质量值是指每个位置Q值的平均值，该图横轴表示Q值，范围0-40，；纵轴是某一Q值对应的reads数。如果测序结果主要集中在高Q值区，证明测序质量较好！（当测序质量峰值小于27（错误率0.2%）时报"WARN"；当峰值小于20（错误率1%）时报"FAIL"。）

ATCG四种碱基在序列上的分布

横轴是1-150 bp；纵轴是百分比。图中四条线代表A T C G在每个位置平均含量。（理论认为，%A=%T，%G=%C，由于测序仪初始状态状态不稳定，可能会出现上图左端的情况。如果存在这种情况，需要去除开始部分序列信息,一般10bp左右即可，具体参考横坐标轴。）

序列平均GC含量分布图

横轴是平均GC值，为1%-100%；纵轴是每条序列GC含量对应的reads数量（蓝线是经验分布理论值，红线是真实值，二者越接近越好。如果出现多峰红线，一般是序列存在污染）

序列每个位置N的比例

reads每个位置N（未知碱基）的比例。当出现峰时，说明测序存在问题。(当任意位置的N的比例超过5%，报"WARN"；当任意位置的N的比例超过20%，报"FAIL"。)

序列测序长度分布

测序长度一般应该相等，允许些许序列存在几bp的偏差存在，这种序列数量少便不影响后续分析。（当测序长度参差不齐，则表明测序数据有误）

统计reads重复水平

横坐标是重复的次数，纵坐标是重复序列占特异序列总数百分比。（和PCR过程类似，测序会产生重复reads,测序深度越高,reads重复数越大；如果重复出现峰值，就提示可能存在偏差。fastqc一般抽取前200,000条reads进行重复数统计。当重复数目≥10时，reads会被合并统计，这也是最右侧起峰的原因；而＞75bp的reads一般只取50bp进行比较。由于reads越长错误率越高，所以重复率统计可能偏低。当非特异的reads占总数的比例大于20%时，报"WARN"；当非特异的reads占总数的比例大于50%时，报"FAIL“）

过度重复出现的序列（很少出现）

接头含量

如果在当时fastqc分析的时候-a选项没有内容，则默认使用图例中的四种通用adapter序列进行统计，本例中接头（adapter）已经去除。如果有adapter序列没有去除干净，会在图中左右端起峰。在后续分析可以使用Trimmomatic等软件去接头。

其他质控软件

还有MultiQC、fastp等常见质控的软件，这里不再一一详细介绍，有需要可留言。

MultiQC：可以整合多个样本的质控结果信息，但相对的信息没有FastQC那么详细、

fastp:除了可以完成质控，还可以进行接头序列去除等操作。

Trimmomatic下载安装

Trimmomatic下载链接

#下载并解压
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.40.zip && unzip Trimmomatic-0.40.zip

#解压后的文件夹
Trimmomatic-0.39
├── adapters              #该文件夹下是ILLUMINA常规的接头文件，有单端和双端。
│ ├── NexteraPE-PE.fa
│ ├── TruSeq2-PE.fa
│ ├── TruSeq2-SE.fa
│ ├── TruSeq3-PE-2.fa
│ ├── TruSeq3-PE.fa
│ └── TruSeq3-SE.fa
├── LICENSE
└── trimmomatic-0.39.jar  #编译之后的文件，使用java执行

1 directory, 8 files


#懒人安装
conda create -n trimmomatic
conda activate trimmomatic
conda install trimmomatic

#也可以使用git clone 从Github上获得0.40的版本

Trimmomatic 使用方法

java -jar ~/softwarepath/trimmomatic-0.39.jar -h
Usage: 
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or: 
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
   or: 
       -version

#参数注释
PE/SE             设定对Paired-End或Single-End的reads进行处理，其输入和输出参数稍有不一样。
-threads          设置多线程运行数
-phred33          设置碱基的质量格式，可选pred64，不设置这个参数，软件会自动判断输入文件是哪种格式
-trimlog          指定过滤日志文件名，日志中包含以下四列内容:read ID、过滤之后剩余序列长度、过滤之后的序列起始碱基位置（序列开头处被切掉了多少个碱基）、过滤之后的序列末端碱基位置、序列末端处被剪切掉的碱基数。
                  #由于生成的trimlog文件中包含了每一条 reads 的处理记录，因此文件体积巨大（GB级别），如果后面不会用到 trim日志，建议不要使用这个参数
-basein           通常 PE 测序的两个文件，指定其中 R1 文件名，软件会推测出 R2 的文件名，但是这个功能实测并不好用，建议不用-basein参数，直接指定两个文件名（R1 和 R2）作为输入
-baseout          输出文件有四个，使用 -baseout 参数指定输出文件的 basename，软件会自动为四个输出文件命名，过滤之后双端序列都保留的就是 paired，反之如果其中一端序列过滤之后被丢弃了另一端序列保留下来了就是 unpaired(即 成对的clean reads, 未成对的正向序列以及未成对的反向序列)
                  #一般情况下，若paired reads百分比占90%以上，可只对paired reads进行比对分析
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:true
                  切除adapter序列。参数后面分别接adapter序列的fasta文件：第一步 seed 搜索时允许的最大错配碱基个数2：palindrome模式下匹配碱基数阈值30：simple模式下的匹配碱基数阈值10(7-15之间）：palindrome 模式允许切除的最短接头序列为 8bp（默认值）：palindrome 模式去除与 R1 完全反向互补的 R2（默认去除false），但在有些情况下，例如需要用到 paired reads 的 bowtie2 流程，就要将这个参数改为 true，否则会损失一部分 paired reads。
                  #按照规定顺序，ILLUMINACLIP 各个参数之间以冒号分开，PE测序需要注意最后一个参数。对于SE测序最后两个参数可以不设置
LEADING:3         切除首端碱基质量小于3的碱基
                  #Illumina平台有些低质量的碱基质量值被标记为 2 ，所以设置为3可以过滤掉这部分低质量碱基。
TRAILING:3        切除尾端碱基质量小于3的碱基
SLIDINGWINDOW:15:20
                  滑窗质量过滤，一般一个read的低质量序列都是集中在末端，也有很少部分在开头。从5'端开始进行滑动，当滑动位点周围一段序列(window)的平均碱基低于阈值，则从该处进行切除。Windows的size是15个碱基（一般设置在10-30之间），其平均碱基质量小于20，则切除
MINLEN:50         可被保留的最短reads长度，应根据原始序列的长度而定
HEADCROP:<length> 在reads的首端切除指定的长度
CROP:<length>     保留reads到指定的长度
TOPHRED33         将碱基质量转换为pred33格式
TOPHRED64         将碱基质量转换为pred64格式

1、SE模式（单端数据模式）

指定输入文件和输出文件（均一个）。

java -jar ~/softwarepath/trimmomatic-0.40.jar SE -threads 线程数 -trimlog logFile（任务日志文件） <input> <output> <step 1> ...

（！不建议开）指定trimlog参数会创建一个记录所有reads 的日志，内容包括:

read的名字，即QNAME
保留序列长度
从起始位置，第一个保留碱基的位置。
原始read中最后一个碱基的位置
从末尾切除掉的碱基个数

可以根据需要指定多个步骤，所需处理(裁剪、接头裁剪等步骤)作为附加参数放在指定在输入/输出文件之后均可。

2、PE模式（双端数据模式）

对于Pair-End数据，需要两个输入文件，会输出4个文件。其中两个文件对应于"paired"数据，即read1和read2都保留。另外两个对应于"unpaired"数据，在处理的过程中会过滤掉其中一端的reads。

java -jar ~/softwarepath/trimmomatic-0.40.jar PE -threads 线程数 -trimlog logFile（任务日志文件）\
 [-basein <inputBase> | <input 1> <input 2>] [-baseout <outputBase> | <paired output 1> <unpaired output 1> <paired output 2> <unpaired output 2> <step 1> ...

## 示例
java -jar ~/softwarepath/trimmomatic-0.39.jar PE \
    # 输入文件
    sample_1.fq.gz sample_2.fq.gz \
    # 输出文件
  Trim_paired_1.fq.gz Trim_unpaired_1.fq.gz \
    Trim_paired_2.fq.gz Trim_unpaired_2.fq.gz \
    # 去接头参数
    ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
    # 切除TruSeq3-PE中提供的Illumina adapter。最初Trimmomatic将寻找seed匹配(16个碱基)，最多允许2个不匹配。如果在配对端读取的情况下达到30分(约50个碱基，50*0.6)，或在单端读取的情况下达到10分(约17个碱基, 17*0.6)，这些"seed"序列将被修剪。
    LEADING:3 \
    # 删除前低质量碱基(低于质量3)
    TRAILING:3 \
    # 删除后低质量碱基(低于质量3)
    SLIDINGWINDOW:4:15 \
    # 扫描4个碱基宽滑动窗口，当每个碱基的平均质量下降到15以下时进行剪切
    MINLEN:80
    # 上述步骤完成后，删除小于80个碱基的reads