sra-tools
create conda env
conda create -n <>
activate conda env
conda activate <>
conda list
conda to install sra-tools
conda install sra-tools
prefetch .sra file
prefetch SRR12345678
downloaded .sra file is in
(instead of the current working dir):
~/ncbi/public/sra/SRR12345678.sra
set up user-defined .sra storage path via link (first backup .sra files in default path)
rm -rf ~/ncbi/*
rmdir ~/ncbi/*
mkdir /path/to/ncbi
ln -s /path/to/ncbi ~/ncbi
# link
prefetch SRR12345678
srapath SRR12345678
# return the current storage pathconvert .sra (paired-end) to
_1.fastq.gz, _2.fastq.gz
(.fastq.gz
for single-end .sra file by removing--split-3
)
fastq-dump --gzip --defline-qual '+' --split-3 SRR12345678.sra
other options availbale for fastq-dump
fastq-dump --help
Bowtie2
used Burrows-Wheeler Transform ($) for searching (used by BWA as well)
wget the corresponding fasta (.fna.gz) file from ncbi genome database
wget <url>
zgrep '^>' ref.fna.gz
install Bowtie2 in conda env
conda install bowtie2
run fastqc on
_1.fastq.gz, _2.fastq.gz
outputfastqc.html, fastqc.zip
for each input
fastqc *.fastq.gz
run multiqc in the current working dir
ouput a dir ofmultiqc_data
andmultiqc_report.html
multiqc .
unzip
ref.fna.gz
to.fna
for Bowtie2 to read
gunzip ref.fna.gz
build ref index for Bowtie2 alignment, output a few
prefix.bt2
,prefex.rev.bt2
files
bowtie2-build ref.fna <output prefix>
bowtie2 alignment to .sam
bowtie2 -x ref.prefix -1 _1.fastq.gz -2 _2.fastq.gz -S .sam
(a process mainly consumes CPUs but not much of memory)other options for bowtie2 command:
time bowtie2 -p # -x ref.prefix -1 _1.fastq.gz -2 _2.fastq.gz -S .sam
(time
reports running time of the program;
-p
specifies core number)look at system running status
top
look at the tail of file dynamically while it's being produced
tail -F .sam
Samtools
- .fastq is txt
- .sam is txt with mapping info (large)
- .bam is compressed binary from .sam
- .sort.bam is sorted by mapping chr location instead of sequencing reads as in .fastq and .sam, and is better for compression, thus with smaller size than .bam
- .sort.bam.bai is to for faster searching of reads in certain chr location in .sort.bam
conda install samtools
conda install samtools
convert .sam to .bam
samtools view -b -o .bam .sam
.sam is large in size, for Eco.li ~1.5 Gb
remove .sam file
(.bam is ~400 Mb)
rm .sam
convert .bam to .sam if needed
samtools view .bam
sort .bam according to chr mapping location
(.sam is sorted by fastq reads)
samtools sort -o .sort.bam .bam
index .sort.bam for faster searching of reads according to chr location
output another.sort.bam.bai
file
samtools index .sort.bam
look at the header of .sort.bam
samtools view -H .sort.bam
search for reads mapped to NC_000913.3:10000-20000
samtools view .sort.bam NC_000913.3:10000-20000
samtools view .sort.bam NC_000913.3:10000-20000 | wc -l
search for flags explaination
samtools flags 17
samtools flags PAIRED
summarize flags in .sort.bam
samtools flagstat .sort.bam
search for commands in history, fyi
history | grep samtools