sra-toolkit-fastqc-trimmomatic

终于搞定了
SRA：sequence read archive
STUDY SAMPLE EXPERIMENT RUN
第一个字母：
S:NCBI's SRA database
E:EBI's database
D:DDBJ's database
第二个字母：R read
第三个字母：
R run
X experiment
S sample
P project/study

wget  https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-centos_linux64.tar.gz
tar -zxvf sratoolkit.3.0.0-centos_linux64.tar.gz  #解压后就安装完毕
./sratoolkit.3.0.0-centos_linux64/bin/prefetch -X 60G -O ./ SRR3583049#单个文件

先去NCBI搜想要的转录组，全部选中

1678416594850.png

选择 Send results to Run selector

1678416644514.png

选择想要的转录组，下载acession list，放到想要的路径

1678416843334.png

./sratoolkit.3.0.0-centos_linux64/bin/prefetch --option-file  SRR_Acc_List.txt  #可以将要下载的SRA放一个文件里下载
./sratoolkit.3.0.0-centos_linux64/bin/fastq-dump --split-3 --gzip SRR7091488 #sra数据转化成fq gzip压缩数据节省空间
#或者不压缩
./sratoolkit.3.0.0-centos_linux64/bin/fastq-dump --split-3 ./SRR3583049/ SRR3583049
#批量转化
cat /路径/SRR_Acc_List.txt | while read line
do
/路径/bin/fastq-dump --split-3 --gzip /路径/$line/$line.sra -O /存储路径
done
#或者使用parallel-fastq-dump非常快
conda install -c bioconda parallel-fastq-dump
parallel-fastq-dump -t 12 --outdir /路径/transcriptome --split-3 --gzip -s /路径/transcriptome/SRR1283218/SRR1283218.sra -T /路径/tmp/

paired-end格式最好加split-3，这样一方有另一方没有的reads会单独放在一个文件内

下面进行转录组质控检测

#挂后台下载fastqc,安装
nohup wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip &
unzip fastqc_v0.11.9.zip
cd /路径/FastQC/
ls
chmod 700 fastqc
#挂后台质控
nohup /路径/FastQC/fastqc -o /路径/transcriptome -t 6 /路径/transcriptome/SRR4294733.fastq &
#完整语句
PATH/fastqc -o 输出目录 [--(no)extract] -f fastq|bam|sam -c contaminant file seqfile1..seqfileN
#批量fastqc
cd /路径/transcriptome
ls
ls *gz | xargs /路径/FastQC/fastqc -t 10

--extract生成的报告默认会打包成一个压缩文件，使用这个参数让程序不打包
-t 程序运行的线程数，越多越快
-c 污染物选项，输入文件里面是可能的污染序列
-a 输入的是测序的adpater序列信息，不输入就按通用的评估

然后是去除接头，使用trimmomatic，一般用来处理illumina测序数据

mkdir trimmomatic
cd trimmomatic
wget http://www.usadellab.org/cms/uploads/suppelmentary/Trimmomatic/Trimmomatic-0.38.zip
unzip Trimmomatic-0.38.zip
cd Trimmomatic-0.38
which java
~/miniconda3/bin/java
pwd
~/miniconda3/bin/java -jar /路径/trimmomatic/Trimmomatic-0.38/trimmomatic-0.38.jar 
#单端
~/miniconda3/bin/java -jar /路径/trimmomatic/Trimmomatic-0.38/trimmomatic-0.38.jar SE -phred33 /路径/transcriptome/SRR4294733.fastq /路径/transcriptome/SRR4294733trim.fastq ILLUMINACLIP:/路径/trimmomatic/Trimmomatic-0.38/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
#双端
~/miniconda3/bin/java -jar /路径/trimmomatic/Trimmomatic-0.38/trimmomatic-0.38.jar PE -threads 10 -phred33 /路径/transcriptome/SRR16122871_1.fastq.gz /路径/transcriptome/SRR16122871_2.fastq.gz /路径/transcriptome/SRR16122871_forward_paired.fastq.gz /路径/transcriptome/SRR16122871_forward_unpaired.fastq.gz /路径/transcriptome/SRR16122871_reverse_paired.fastq.gz /路径/transcriptome/SRR16122871_reverse_unpaired.fastq.gz ILLUMINACLIP:/路径/trimmomatic/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 HEADCROP:8
#headcrop剪去头端不合格的碱基，根据fastqc结果确定剪去多少

质量值体系是Phred33还是Phred64，默认是Phred64，这需要特别注意，因为我们现在的测序数据基本都是Phred33的了，所以一定要指定这个参数

ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
按照你的数据选择接头文件列表TruSeq3对应HiSeq和MiSeq
TruSeq2 (as used in GAII machines)
TruSeq3 (as used by HiSeq and MiSeq machines),

SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
SLIDINGWINDOW:<windowSize>:<requiredQuality>
对应两个参数窗口大小（碱基数）和对应碱基序列的质量。一般就是4和15，没必要乱改。除非数据质量实在是很差。

LEADING: Cut bases off the start of a read, if below a threshold quality
因为机器对初始几个序列检测不太准，一般默认依次把质量低于3的碱基切掉

TRAILING: Cut bases off the end of a read, if below a threshold quality
同理，尾部也能切掉，不过没必要。尤其是当你数据是双端测序结果的时候

CROP: Cut the read to a specified length
直接从中间切断丢弃尾部序列，慎用

HEADCROP: Cut the specified number of bases from the start of the read
切掉头部对应碱基数并丢弃，同样，慎用

MINLEN: Drop the read if it is below a specified length
这个参数重要也不重要，你需要看一眼你的FastQC结果，一般读段都在100 bp左右，这个时候默认36就好。如果你的读段是50 bp甚至更短，你就需要修改这个参数。改的越低，结果里就有越多的错误读段。

sra-toolkit-fastqc-trimmomatic

sra-toolkit-fastqc-trimmomatic

相关阅读更多精彩内容

友情链接更多精彩内容