SRA(Sequence ReadArchive)数据库是用于存储二代测序的原始数据,包括 454,Illumina,SOLiD,IonTorrent,Helicos 和 CompleteGenomics。除了原始序列数据外,SRA现在也存在raw reads在参考基因的比对信息。
根据SRA数据产生的特点,将SRA数据分为四类:
- Studies-- 研究课题
- Experiments-- 实验设计
- Runs-- 测序结果集
- Samples-- 样品信息
SRA Toolkit是将NCBI数据库中sra文件下载并转换为 .fstaq.gz文件的工具。
进入NCBI官网,选择SRA数据库
找到sra toolkit下载页面
复制下载链接
在linux中使用wget命令下载
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.8/sratoolkit.2.10.8-ubuntu64.tar.gz
- 将文件移动至指定文件夹,如/home/sratoolkit
mkdir /home/sratoolkit
mv sratoolkit.2.10.8-ubuntu64.tar.gz /home/sratoolkit
解压
cd /home/sratoolkit
tar xzvf sratoolkit.2.10.8-ubuntu64.tar.gz
修改.bashrc文件
echo "export PATH=\$PATH:/home/sratoolkit/sratoolkit.2.10.8-ubuntu64/bin" >> ~/.bashrc
source ~/.bashrc
fastq-dump -h
安装到fastq-dump -h时报错,按照报错原因运行 vdb-config --interactive即可
sra toolkit使用
-
SRA检索,以brca为例,可以在NCBI sra数据库检索到大量的测序数据,另外paper一般也会提供测序数据的SRA号,可直接根据号码进行检索
prefetch命令下载文件,比如:prefetch SRR11097713
prefetch Usage:
prefetch [options] <path/SRA file | path/kart file> [<path/file> ...]
prefetch [options] <SRA accession>
prefetch [options] --list <kart_file>
Frequently Used Options:
General:
-h | --help Displays ALL options, general usage, and version information.
-V | --version Display the version of the program.
Data transfer:
-f | --force <value> Force object download. One of: no, yes, all. no [default]: Skip download if the object if found and complete; yes: Download it even if it is found and is complete; all: Ignore lock files (stale locks or if it is currently being downloaded: use at your own risk!).
--transport <value> Value one of: ascp (only), http (only), both (first try ascp, fallback to http). Default: both.
-l | --list List the contents of a kart file.
-s | --list-sizes List the content of kart file with target file sizes.
-N | --min-size <size> Minimum file size to download in KB (inclusive).
-X | --max-size <size> Maximum file size to download in KB (exclusive). Default: 20G.
-o | --order <value> Kart prefetch order. One of: kart (in kart order), size (by file size: smallest first). default: size.
-a | --ascp-path <ascp-binary|private-key-file> Path to ascp program and private key file (asperaweb_id_dsa.openssh).
-p | --progress <value> Time period in minutes to display download progress (0: no progress). Default: 1.
--option-file <file> Read more options and parameters from the file.
fastq-dump
- 将sra转换成fastq:
fastq-dump SRR11097713
- sra转换成fasta:
fastq-dump --fasta 50 SRR11097713
- 将双端测序文件分开:
fastq-dump --split-files SRR11097713
fastq-dump Usage:
fastq-dump [options] <path/file> [<path/file> ...]
fastq-dump [options] <accession>
Frequently Used Options:
General:
-h | --help Displays ALL options, general usage, and version information.
-V | --version Display the version of the program.
Data formatting:
--split-files Dump each read into separate file. Files will receive suffix corresponding to read number.
--split-spot Split spots into individual reads.
--fasta <[line width]> FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
-I | --readids Append read id after spot id as 'accession.spot.readid' on defline.
-F | --origfmt Defline contains only original sequence name.
-C | --dumpcs <[cskey]> Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
-B | --dumpbase Formats sequence using base space (default for other than SOLiD).
-Q | --offset <integer> Offset to use for ASCII quality scores. Default is 33 ("!").
Filtering:
-N | --minSpotId <rowid> Minimum spot id to be dumped. Use with "X" to dump a range.
-X | --maxSpotId <rowid> Maximum spot id to be dumped. Use with "N" to dump a range.
-M | --minReadLen <len> Filter by sequence length >= <len>
--skip-technical Dump only biological reads.
--aligned Dump only aligned sequences. Aligned datasets only; see sra-stat.
--unaligned Dump only unaligned sequences. Will dump all for unaligned datasets.
Workflow and piping:
-O | --outdir <path> Output directory, default is current working directory ('.').
-Z | --stdout Output to stdout, all split data become joined into single stream.
--gzip Compress output using gzip.
--bzip2 Compress output using bzip2.