sam和bam格式文件的shell小练习

准备

mkdir -p ~/biosoft
cd ~/biosoft
wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.4.3/bowtie2-2.3.4.3-linux-x86_64.zip 
unzip bowtie2-2.3.4.3-linux-x86_64.zip 
cd ~/biosoft/bowtie2-2.3.4.3-linux-x86_64/example/reads
../../bowtie2 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq > tmp.sam
# samtools view -bS tmp.sam >tmp.bam

命令理解

命令内容

../../bowtie2 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq > tmp.sam #比对命令

2.命令解释（参考生信技能树）：
bowtie的使用分为两步，首先建立索引，之后进行比对。
建索引：
基本命令：bowtie2-build your-fastq-file.fa your-index-name
example中的read已经建立好了索引，索引位置及前缀为../index/lambda_virus
比对基本格式：
单端测序：Bowtie –x tmp –U（表示单端） reads.fa –S hahahhha.sam
双端测序：Bowtie –x tmp -1 reads1.fa -2 reads2.fa –S hahahha.sam

../../bowtie2 还没有将bowtie2写入环境变量，因此使用bowtie2需要写明全路径
../index/lambda_virus 表示建立好的索引文件
-S 后接输出文件名
总结：将reads1.fa及reads2.fa 两条序列比对至索引文件../index/lambda_virus上比对后产生tmp.sam文件
1 统计共多少条reads(pair-end reads这里算一条)参与了比对参考基因组

less -SN tmp.sam |grep -v "^@" |cut -f1|sort -n|uniq |wc -l
10000

2 统计共有多少种比对的类型(即第二列数值有多少种)及其分布。

less -SN tmp.sam |grep -v "^@" |cut -f2|sort -n |uniq -c|less -S
      1      24 65
      2     165 69
      3     153 73
      4     213 77
      5       2 81
      6    4650 83
      7     136 89
      8    4516 99
      9     125 101
     10      16 113
     11      24 129
     12     153 133
     13     165 137
     14     213 141
     15    4516 147
     16     125 153
     17       2 161
     18    4650 163
     19     136 165
     20      16 177
#查看各个数字表示内容的方法：谷歌picard sam flag--点击进入第一个--在sam flag中输入163 --之后主要看右边 ，还是没咋看懂

3 筛选出比对失败的reads，看看序列特征。

less -SN tmp.sam |grep -v "^@" |cut -f6 |grep -v ["M"]|wc -l
1005
cat tmp.sam |grep -v "^@" |awk '{if ($6=="*")print}'|wc -l
1005
less -SN tmp.sam |grep -v "^@" |cut -f6,10|grep -v ["M"]|less -SN 
cat tmp.sam |grep -v "^@" |awk '{if ($6=="*")print $10}'   #查看序列特征,发现比对不上的N较多

4 比对失败的reads区分成单端失败和双端失败情况，并且拿到序列ID

cat tmp.sam |grep -v "^@" |awk '{if ($6=="*")print}' |cut -f1|sort -n |uniq -c|grep -w 2 #显示两端都没有比对上的id
cat tmp.sam |grep -v "^@" |awk '{if ($6=="*")print}' |cut -f1|sort -n |uniq -c|grep -w 1#显示只有一端都没有比对上的id

5 筛选出比对质量值大于30的情况（看第5列）

cat tmp.sam |grep -v "^@" |awk '{if ($5>30)print}' |wc -l
18632

sam文件中，第一列为ID，第二列为比对类型，第三列是参考基因组信息，第四列为坐标，第五列为质量值，第6列为比对信息，第十列为碱基序列
6 筛选出比对成功，但是并不是完全匹配的序列

less -SN tmp.sam |grep -v "^@" |cut -f6 |grep  [IDNSPH=X]

7 筛选出inset size长度大于1250bp的 pair-end reads

cat tmp.sam |grep -v "^@" |awk '{if ($7>1250)print}' |head |less -SN
#看了jimmy老师讲解，但是我看第七列是等号啊，好奇怪哦

8 统计参考基因组上面各条染色体的成功比对reads数量

cat tmp.sam |grep -v "^@" |cut -f3 |sort -n |uniq -c
    426 *
  19574 gi|9626243|ref|NC_001416.1|
#发现全比对到第一条染色体上面了

9 筛选出原始fq序列里面有N的比对情况

cat tmp.sam |grep -v "^@" |cut -f10 |grep N
awk '{if($10~"N")print}' tmp.sam |less -SN

10 筛选出原始fq序列里面有N，但是比对的时候却是完全匹配的情况

awk '{if($10~"N")print}' tmp.sam |awk '{if($6 !~ "[IDNSPH=X]")print}' |awk '{if ($6!="*")print}'|less -SN

11 sam文件里面的头文件行数

grep -n  "^@" tmp.sam |wc -l
3

12 sam文件里每一行的tags个数一样吗

13 sam文件里每一行的tags个数分别是多少个
14 sam文件里记录的参考基因组染色体长度分别是？

$ grep "^@" tmp.sam 
@HD VN:1.0  SO:unsorted
@SQ SN:gi|9626243|ref|NC_001416.1|  LN:48502
@PG ID:bowtie2  PN:bowtie2  VN:2.3.4.3  CL:"/trainee1/vip77/biosoft/bowtie2-2.3.4.3-linux-x86_64/example/reads/../../bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"

SN:gi|9626243|ref|NC_001416.1| LN:48502 只有一条染色体，且长度为48502

15 找到比对情况有insertion情况的

awk '{if($6 ~ "[I]")print}' tmp.sam  |less -SN

16 找到比对情况有deletion情况的

awk '{if($6 ~ "[D]")print}' tmp.sam  |less -SN

17 取出位于参考基因组某区域的比对记录，比如 5013到50130 区域

awk '{if($4>5013&& $4<50130)print}' tmp.sam  |less -SN

18 把sam文件按照染色体以及起始坐标排序

cat tmp.sam|grep -v "^@"|sort -n -k4  |less -SN

都在同一条染色体上

19 找到 102M3D11M 的比对情况，计算其reads片段长度。

grep 102M3D11M tmp.sam |awk '{print length($10)}'
113

今天先更这么哦，感觉自己好腻害。。。。。

sam和bam格式文件的shell小练习 2019-05-04

sam和bam格式文件的shell小练习 2019-05-04

sam和bam格式文件的shell小练习

SN:gi|9626243|ref|NC_001416.1| LN:48502 只有一条染色体，且长度为48502

都在同一条染色体上

推荐阅读更多精彩内容