1、创建文件夹:
vip41@VM-0-15-ubuntu:~/linux练习题$ mkdir -p 1/2/3/4/5/6/7/8/9
vip41@VM-0-15-ubuntu:~/linux练习题$ cd 1
vip41@VM-0-15-ubuntu:~/linux练习题/1$ tree
.
└── 2
└── 3
└── 4
└── 5
└── 6
└── 7
└── 8
└── 9
2、3、创建.txt文件并往文件内写东西 :两种方法:
第一种方法:
vip41@VM-0-15-ubuntu:~/linux练习题/1$ cd 2/3/4/5/6/7/8/9/
vip41@VM-0-15-ubuntu:~/linux练习题/1/2/3/4/5/6/7/8/9$ touch me.txt
vip41@VM-0-15-ubuntu:~/linux练习题/1/2/3/4/5/6/7/8/9$ ls -lh
total 0
-rw-rw-r-- 1 vip41 vip41 0 Dec 16 22:58 me.txt
vip41@VM-0-15-ubuntu:~/linux练习题/1/2/3/4/5/6/7/8/9$ cat >me.txt
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?
^C
vip41@VM-0-15-ubuntu:~/linux练习题/1/2/3/4/5/6/7/8/9$ cat me.txt
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?
第二种办法:
vip41@VM-0-15-ubuntu:~/linux练习题/1/2/3/4/5/6/7/8/9$ cat >me.txt
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?
^C
vip41@VM-0-15-ubuntu:~/linux练习题/1/2/3/4/5/6/7/8/9$ cat me.txt
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?
vip41@VM-0-15-ubuntu:~/linux练习题/1/2/3/4/5/6/7/8/9$ ls -lh
total 4.0K
-rw-rw-r-- 1 vip41 vip41 66 Dec 16 23:03 me.txt
4、删除所有创建的文件夹及文件:
vip41@VM-0-15-ubuntu:~/linux练习题/1/2/3/4/5/6/7/8/9$ cd ~/linux练习题/
vip41@VM-0-15-ubuntu:~/linux练习题$ rm -r 1
vip41@VM-0-15-ubuntu:~/linux练习题$ ls -lh
total 4.9M
-rw-rw-r-- 1 vip41 vip41 2.6M Jan 11 2017 hg38.tss
drwxrwxr-x 4 vip41 vip41 4.0K Nov 12 2016 rmDuplicate
-rw-rw-r-- 1 vip41 vip41 103K Nov 12 2016 rmDuplicate.zip
drwxrwxr-x 3 vip41 vip41 4.0K Dec 14 23:32 sickle-results
-rw-rw-r-- 1 vip41 vip41 2.3M Oct 6 2016 sickle-results.zip
-rw-rw-r-- 1 vip41 vip41 3.1K May 18 2017 test.bed
#文件夹1已被删除
5、创建 folder1~5这5个文件夹,然后每个文件夹下面继续创建 folder1~5这5个文件夹:mkdir -p folder_{1..5}/folder_{1..5}
6、在第五题创建的每一个文件夹下面都 创建第二题文本文件 me.txt ,内容也要一样:
xargs 一般是和管道一起使用
somecommand |xargs -item command
7、删除之前的文件夹及文件:
8、下载 http://www.biotrainee.com/jmzeng/igv/test.bed 文件,后在里面选择含有 H3K4me3 的那一行是第几行,该文件总共有几行。
9、下载 http://www.biotrainee.com/jmzeng/rmDuplicate.zip 文件,并且解压,查看里面的文件夹结构
vip41@VM-0-15-ubuntu:~/biosoft/data1$ wget http://www.biotrainee.com/jmzeng/rmDuplicate.zip
--2018-12-13 22:09:18-- http://www.biotrainee.com/jmzeng/rmDuplicate.zip
Resolving www.biotrainee.com (www.biotrainee.com)... 123.206.72.184
Connecting to www.biotrainee.com (www.biotrainee.com)|123.206.72.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104931 (102K) [application/zip]
Saving to: ‘rmDuplicate.zip’
rmDuplicate.zip 100%[===========>] 102.47K 551KB/s in 0.2s
2018-12-13 22:09:19 (551 KB/s) - ‘rmDuplicate.zip’ saved [104931/104931]
# 解压文件
vip41@VM-0-15-ubuntu:~/biosoft/data1$ unzip rmDuplicate.zip
Archive: rmDuplicate.zip
creating: rmDuplicate/
creating: rmDuplicate/picard/
creating: rmDuplicate/picard/paired/
inflating: rmDuplicate/picard/paired/readme.txt
inflating: rmDuplicate/picard/paired/tmp.header
inflating: rmDuplicate/picard/paired/tmp.MarkDuplicates.log
inflating: rmDuplicate/picard/paired/tmp.metrics
inflating: rmDuplicate/picard/paired/tmp.rmdup.bai
inflating: rmDuplicate/picard/paired/tmp.rmdup.bam
inflating: rmDuplicate/picard/paired/tmp.sam
inflating: rmDuplicate/picard/paired/tmp.sorted.bam
creating: rmDuplicate/picard/single/
inflating: rmDuplicate/picard/single/.MarkDuplicates.log
inflating: rmDuplicate/picard/single/readme.txt
inflating: rmDuplicate/picard/single/tmp.header
inflating: rmDuplicate/picard/single/tmp.MarkDuplicates.log
inflating: rmDuplicate/picard/single/tmp.metrics
inflating: rmDuplicate/picard/single/tmp.rmdup.bai
inflating: rmDuplicate/picard/single/tmp.rmdup.bam
inflating: rmDuplicate/picard/single/tmp.sam
inflating: rmDuplicate/picard/single/tmp.sorted.bam
creating: rmDuplicate/samtools/
creating: rmDuplicate/samtools/paired/
inflating: rmDuplicate/samtools/paired/readme.txt
inflating: rmDuplicate/samtools/paired/tmp.header
inflating: rmDuplicate/samtools/paired/tmp.rmdup.bam
inflating: rmDuplicate/samtools/paired/tmp.rmdup.vcf.gz
inflating: rmDuplicate/samtools/paired/tmp.sam
inflating: rmDuplicate/samtools/paired/tmp.sorted.bam
inflating: rmDuplicate/samtools/paired/tmp.sorted.vcf.gz
creating: rmDuplicate/samtools/single/
inflating: rmDuplicate/samtools/single/readme.txt
inflating: rmDuplicate/samtools/single/tmp.header
inflating: rmDuplicate/samtools/single/tmp.rmdup.bam
inflating: rmDuplicate/samtools/single/tmp.rmdup.vcf.gz
inflating: rmDuplicate/samtools/single/tmp.sam
inflating: rmDuplicate/samtools/single/tmp.sorted.bam
inflating: rmDuplicate/samtools/single/tmp.sorted.vcf.gz
# 查看文件夹结构
vip41@VM-0-15-ubuntu:~/biosoft/data1/rmDuplicate$ tree
.
├── picard
│ ├── paired
│ │ ├── readme.txt
│ │ ├── tmp.header
│ │ ├── tmp.MarkDuplicates.log
│ │ ├── tmp.metrics
│ │ ├── tmp.rmdup.bai
│ │ ├── tmp.rmdup.bam
│ │ ├── tmp.sam
│ │ └── tmp.sorted.bam
│ └── single
│ ├── readme.txt
│ ├── tmp.header
│ ├── tmp.MarkDuplicates.log
│ ├── tmp.metrics
│ ├── tmp.rmdup.bai
│ ├── tmp.rmdup.bam
│ ├── tmp.sam
│ └── tmp.sorted.bam
└── samtools
├── paired
│ ├── readme.txt
│ ├── tmp.header
│ ├── tmp.rmdup.bam
│ ├── tmp.rmdup.vcf.gz
│ ├── tmp.sam
│ ├── tmp.sorted.bam
│ └── tmp.sorted.vcf.gz
└── single
├── readme.txt
├── tmp.header
├── tmp.rmdup.bam
├── tmp.rmdup.vcf.gz
├── tmp.sam
├── tmp.sorted.bam
└── tmp.sorted.vcf.gz
10、打开第九题解压的文件,进入 rmDuplicate/samtools/single 文件夹里面,查看后缀为 .sam 的文件,搞清楚 生物信息学里面的SAM/BAM 定义是什么。
- SAM(Sequence Alignment/Map)格式是一种通用的比对格式,用来存储reads到参考序列的比对信息。
SAM是一种序列比对格式标准,由sanger制定,是以TAB为分割符的文本格式。主要应用于测序序列mapping到基因组上的结果表示,当然也可以表示任意的多重比对结果。SAM分为两部分,注释信息(header section)和比对结果部分(alignment section)。 - BAM是SAM的二进制格式,因此两者格式相同,只是BAM文件占用储存空间更小,运算更快。
具体参考:https://www.jianshu.com/p/9c99e09630da
11、安装 samtools 软件:
第一种办法;
1、先安装miniconda: 用wget 下载miniconda, bash .sh文件,进行安装,一直回车,直到提示回复y/n,选择y, 直到安装成功。source ~/.bashrc 激活conda, 验证可用。
2、先查找是否有Samtools可安装
缺乏频道,无法安装,需要更改镜像源。
加入一些与生物相关的镜像源:
3、用conda下载samtools:
samtools可用:
第二种办法:
用wget下载samtools,步骤如下 :
echo 'export PATH=/home/vip41/biosoft/samtools/bin:$PATH' >>~/.bashrc
source ~/.bashrc
cd ~/biosoft/ samtools
wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
tar -xvfz samtools-1.9.tar.bz2
cd samtools-1.9
./configure --prefix=/home/vip41/biosoft/samtools
make && make install
附:linux下 的软件安装需要指定路径,而且是自己有权限的路径。
验证samtools可用:
Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
fqidx index/extract FASTQ
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
markdup mark duplicates
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
samtools可用,但是基本命令如ls、pwd等无法使用,使用该命令(export PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin
)解决ls等基本命令无法使用问题,但是samtools又变成无法使用,重新添加环境变量,samtools可用,但是ls等基本命令又无法使用:
vip41@VM-0-15-ubuntu:~/biosoft/samtools$ echo 'export PATH=/home/vip41/biosoft/samtools/bin:$PATH'>>~/.bashrc
vip41@VM-0-15-ubuntu:~/biosoft/samtools$ source ~/.bashrc
vip41@VM-0-15-ubuntu:~/biosoft/samtools$ ls
Command 'ls' is available in '/bin/ls'
The command could not be located because '/bin' is not included in the PATH environment variable.
ls: command not found
vip41@VM-0-15-ubuntu:~/biosoft/samtools$ samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
fqidx index/extract FASTQ
index index alignment
解决办法,使用命令vi ~/.bashrc
删除samtools的路径,问题解决。详见:https://www.jianshu.com/p/57abd0804df6
12、打开 后缀为BAM 的文件,找到产生该文件的命令。
- 使用find命令先找到.bam的文件:
vip41@VM-0-15-ubuntu:~$ find ~ -name '*.bam'
/home/vip41/biosoft/htslib/htslib-1.9/test/range.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/bedcov/bedcov.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/quickcheck/1.quickcheck.badeof.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/quickcheck/3.quickcheck.ok.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/quickcheck/2.quickcheck.badheader.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/quickcheck/4.quickcheck.ok.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/stat/11_target.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/stat/12_overlaps.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/mpileup.1.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/ce#5b.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/ce#unmap1.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/xx#triplet.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/c1#ID.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/mpileup.3.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/mpileup.2.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/c1#clip.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/ce#unmap.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/overlapIllumina.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/ce#unmap2.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/ce#large_seq.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/c1#pad1.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/c1#pad3.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/xx#minimal.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/c1#pad2.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/c1#ID2.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/mpileup-E.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/mpileup/1read.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/dat/test_input_1_a.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/dat/test_input_1_c.bam
/home/vip41/biosoft/samtools/samtools-1.9/test/dat/test_input_1_b.bam
/home/vip41/biosoft/samtools/samtools-1.9/htslib-1.9/test/range.bam
/home/vip41/linux练习题/rmDuplicate/picard/paired/tmp.rmdup.bam
/home/vip41/linux练习题/rmDuplicate/picard/paired/tmp.sorted.bam
/home/vip41/linux练习题/rmDuplicate/picard/single/tmp.rmdup.bam
/home/vip41/linux练习题/rmDuplicate/picard/single/tmp.sorted.bam
/home/vip41/linux练习题/rmDuplicate/samtools/paired/tmp.rmdup.bam
/home/vip41/linux练习题/rmDuplicate/samtools/paired/tmp.sorted.bam
/home/vip41/linux练习题/rmDuplicate/samtools/single/tmp.rmdup.bam
/home/vip41/linux练习题/rmDuplicate/samtools/single/tmp.sorted.bam
2)使用命令samtools view
查看.bam文件:
vip41@VM-0-15-ubuntu:~/linux练习题/rmDuplicate/samtools/paired$ samtools view tmp.rmdup.bam
D00691:39:C7HGRANXX:7:1102:7445:18770 99 chr10 93614 60 126M = 93621 133 GGCACGTGGTGACCCCACTCATGGTAGCAGACACCAGGTGGTTCAGGTCACCATAGGTGGGTGTGGGCAGTTTTAGGGTCTTGGAACATATGTCATACAGAGCTTCGTTATCTATGCAAAAGGTCT BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:126 AS:i:126 XS:i:111 XA:Z:chr3,-197846908,126M,3;chr9,-141070974,126M,3;chr1,-808987,126M,4;chr4,+190904265,126M,4; MQ:i:60
D00691:39:C7HGRANXX:7:1102:7445:18770 147 chr10 93621 60 126M = 93614 -133 GGTGACCCCACTCATGGTAGCAGACACCAGGTGGTTCAGGTCACCATAGGTGGGTGTGGGCAGTTTTAGGGTCTTGGAACATATGTCATACAGAGCTTCGTTATCTATGCAAAAGGTCTCATCTGC FFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFBBBBB NM:i:0 MD:Z:126 AS:i:126 XS:i:111 XA:Z:chr9,+141070967,126M,3;chr3,+197846902,1S125M,3; MQ:i:60
D00691:39:C7HGRANXX:7:2302:14294:49245 323 chr10 94741 5 56H70M chr9 140136176 0 CACCGGTGGCTTCGTTGTAGTACACGTTGATGCGCTCCAGCTGGAGGTCGCTATCTCCGTGGTAAGTGCC FFFFFFFFFFFFFF<FFBFFFFFFFFFBFFFFFFFFFFBFFFFFFFB7FFFBFFFFFFFFFFFFBFFFFF NM:i:7 MD:Z:5C4C32C4G3G2C8G5 AS:i:35 XS:i:30 SA:Z:chr9,140136365,-,69S57M,53,4; XA:Z:chrY,+19630445,56S70M,8;chrY,-20549264,70M56S,8;
D00691:39:C7HGRANXX:7:2201:12400:93441 387 chr10 94741 6 67H59M chr9 140136365 0 CACCGGTGGCTTCGTTGTAGTACACGTTGATGCGCTCCAGCTGGAGGTCGCTATCTCCG FFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:6 MD:Z:5C4C32C4G3G2C3 AS:i:33 XS:i:28 SA:Z:chr9,140136365,-,58S68M,46,6; XA:Z:chrY,-20549275,59M67S,7;chrY,+19630445,67S59M,7;
D00691:39:C7HGRANXX:7:1313:10466:58327 353 chr10 94741 5 57H69M chr19 6502177 0 CACCGGTGGCTTCGTTGTAGTACACGTTGATGCGCTCCAGCTGGAGGTCGCTATCTCCGTGGTAAGTGC FFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFF NM:i:7 MD:Z:5C4C32C4G3G2C8G4 AS:i:34 XS:i:29 SA:Z:chr9,140136365,-,68S58M,52,4; XA:Z:chrY,-20549265,69M57S,8;chrY,+19630445,57S69M,8;
D00691:39:C7HGRANXX:7:1301:9495:12136 97 chr10 94741 4 51S75M chr19 6502177 0 CCATGGTGCCGGGCTCCAAGTCCACGAGCACGGCGCGGGGCACATACTTGCCACCGGTGGCTTCGTTGTAGTACACGTTGATGCGCTCCAGCTGGAGGTCGCTATCTCCGTGGTAAGTGCCAGTGG BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFF<<BFFFFFFF//BFFFFFFFFFFFFFFFFFFFFFBFFBFF NM:i:8 MD:Z:5C4C32C4G3G2C8G7C2 AS:i:37 XS:i:32 SA:Z:chr9,140136365,-,74S52M,4,3; XA:Z:chrY,+19630445,51S75M,9;chrY,-20549259,75M51S,9; MQ:i:25
D00691:39:C7HGRANXX:7:2201:12400:93441 371 chr10 94741 6 64H61M chr9 140136365 0 CACCGGTGGCTTCGTTGTAGTACACGTTGATGCGCTCCAGCTGGAGGTCGCTATCTCCGTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB NM:i:6 MD:Z:5C4C32C4G3G2C5 AS:i:33 XS:i:28 SA:Z:chr9,140136365,+,60S65M,49,5; XA:Z:chrY,+20549273,61M64S,7;chrY,-19630445,64S61M,7;
D00691:39:C7HGRANXX:7:2105:4071:100631 433 chr10 94741 8 76H50M chr9 140136947 0 CACCGGTGGCTTCGTTGTAGTACACGTTGATGCGCTCCAGCTGGAGGTCG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB NM:i:4 MD:Z:5C4C32C4G1 AS:i:33 XS:i:28 SA:Z:chr9,140136365,+,49S77M,54,6; XA:Z:chrY,+20549284,50M76S,5;chrY,-19630445,76S50M,5;
D00691:39:C7HGRANXX:7:2308:13827:14580 97 chr10 94765 0 48M77S chr19 6502177 0 CGTTGATGCGCTCCAGCTGGAGGTCGCTATCTCCGTGGTAAGTGCCAGTGGGATCAATGCCATGCTCGTCGCTGATTACCTCCCAGAACTTGGCGCCAATCTGGTTGCCGCACTGCCCAGCCTGC BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFF<F/BFFFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFF NM:i:5 MD:Z:19C4G3G2C8G7 AS:i:23 XS:i:23 XA:Z:chr4,+190905620,48M77S,5;chr9,-141069855,77S48M,5;chr18,+49150,48M77S,5; MQ:i:25
D00691:39:C7HGRANXX:7:1209:11322:53678 99 chr10 95877 48 126M = 96056 305 CCAGGCTTTTGGATTACCCAAACTGAGGAGTTATTTCTTCTGGTAAACATTTTTCAGATGGGGTGGGGAATGTCTCGATCTAACCAGTGAAGGTGTCAGTAAGCATTAGCAAATATTTGAATCTCC BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFF NM:i:2 MD:Z:0G71C53 AS:i:120 XS:i:115 XA:Z:chr16,-90159613,126M,3;chr4,+190906717,126M,5;chr18,+50253,126M,5;chrY,-9967205,126M,5;chr9,-141068669,126M,5; MQ:i:60
D00691:39:C7HGRANXX:7:1209:11322:53678 147 chr10 96056 60 126M = 95877 -305 CCTCAATTTTGGACAGGTTTAACTGGAGAAGGAGAAAATTGCTGGCCATTTGAGTCATGTCAGGCACAAAGCTCACAGGGCTGAGTCACCTGTTTTAGTGTCTTGAAAAGATTTACCCCTATAAAC FFFFFFFFBFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB NM:i:0 MD:Z:126 AS:i:126 XS:i:111 XA:Z:chr18,-50432,126M,3;chr9,+141068490,126M,4;chrY,+9967026,125M1S,4;chr16,+90159438,4S117M5S,3; MQ:i:48
D00691:39:C7HGRANXX:7:1110:2613:30476 99 chr10 116605 60 122M = 116651 172 CCCAGTCTCTACTAAAAATACAATAATCAGCCAGGCATGGTGGCGCAGACCTGTAATCCCAGCTACTCAAGAGGCTGAGGACGAATTACTTGAACACAGGAGGTGGAGGCTGCAGTCAGCCG BBBBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFBBFFFFF NM:i:0 MD:Z:122 AS:i:122 XS:i:102 XA:Z:chr9,-141048001,122M,4; MQ:i:60
D00691:39:C7HGRANXX:7:1110:2613:30476 147 chr10 116651 60 126M = 116605 -172 AGACCTGTAATCCCAGCTACTCAAGAGGCTGAGGACGAATTACTTGAACACAGGAGGTGGAGGCTGCAGTCAGCCGAGATCTCCACTGCGCCACTGCACTCCAGCATGGGAGACAGAGCAGAACCC B<FF/BFFF</FFFFFFBFFBFF<FFBFFFFFFFFFFFBFFFFFFFFFFFFF<FFFB<BFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBF<FFFFFFFFFBFFFFFFFFFFFBBBBB NM:i:0 MD:Z:126 AS:i:126 XS:i:111 XA:Z:chr9,+141047951,126M,3;chr18,-70028,126M,4; MQ:i:60
D00691:39:C7HGRANXX:7:1110:5694:99774 99 chr10 117523 60 39M1I17M = 117654 257 TGCACTCCAGCCTGAGAAGAAAAAGTGAAAATCTGTCTCAAAAAAAAAAGTGGGGAG BBBBB<FFFFFFFFFFFFFFFFFFFFF</</BFFFBBFF<<FFFB<BFFF<FFFFBF NM:i:1 MD:Z:56 AS:i:49 XS:i:29 MQ:i:60
D00691:39:C7HGRANXX:7:1110:5694:99774 147 chr10 117654 60 126M = 117523 -257 CCCAGGAGTTAAAGACAAACCTGAGGGACATAAAGATCCTGCTTAAATTAGCAGTGCATGGTGGCTGGTGCCTATAGTCCAAGCTACTTGGGAGGCTGAGGCAGGAGGATTGCTGGAGCCCAGGAG BFBBB/BF/BFFFFFFFFFFFFBFFFFFFFFFFFFFBFFB/FBFFBF<FFFFFBBFFFFFFFBFF<B<FB/FFFFFBFFFFFFFFFBBFFFFFFB<FFFFFFFFFFFFFFFFBFBBFFFFFBBBBB NM:i:0 MD:Z:126 AS:i:126 XS:i:101 XA:Z:chr18,-71035,126M,5; MQ:i:60
D00691:39:C7HGRANXX:7:1108:3465:47458 147 chr10 119991 5 38S20M66S = 119991 -20 TACAGCAGCCTGGATGGATTTGATCCACTCATCTTTCTCCTCCTGGGTGGGTGCTGAGATTCGGTACACCATGTGGTTCCCCTCCACTACCCGGCCGTCAGCCTCTGTTTTGCAGGCTTTGATG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB NM:i:0 MD:Z:20 AS:i:20 XS:i:19 XA:Z:chr12,-54786996,31S19M74S,0; MQ:i:5
D00691:39:C7HGRANXX:7:1108:3465:47458 99 chr10 119991 5 67S20M39S = 119991 20 GCCAACATCTCATAGAAGGGGTCCACACTTACAGCAGCCTGGATGGATTTGATCCACTCATCTTTCTCCTCCTGGGTGGGTGCTGAGATTCGGTACACCATGTGGTTCCCCTCCACTACCCGGCCG BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF< NM:i:0 MD:Z:20 AS:i:20 XS:i:19 XA:Z:chr12,+54786996,60S19M47S,0; MQ:i:5
D00691:39:C7HGRANXX:7:1110:18049:20672 99 chr10 122194 60 125M = 122356 288 CCACGTTCCTCCTCCTCTCCGGGGACCAGGGTCTCTCCCCAGAAACAAAATCGCATCGGTAACCGGCATCTTGTCCTGTGCTGGGGGTGAGCCGCCCAAGCCTCCATGAAGGGACGCTCGTACAA BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBFFFFFFFFFFFFFFFF NM:i:0 MD:Z:125 AS:i:125 XS:i:105 XA:Z:chr18,+75600,125M,4; MQ:i:60
D00691:39:C7HGRANXX:7:1110:18049:20672 147 chr10 122356 60 126M = 122194 -288 GCCGCGCTCGCTCCGCTGCACTCACAGCGGCGGCAGGAAGCCTTTTTCTCACTTTCTCCCCGGCGGCCCCAGGTGTCCCGGAGCGTCTCCCTGTCCTCACAGCGGACGTGGCCCCAGGTGTCCCGA <FFFFFFBFBFBFFFFFFFFFFFFFFFBFFFFFFFFFFFF<7/FFFFFFFFFFFBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB NM:i:0 MD:Z:126 AS:i:126 XS:i:106 XA:Z:chr18,-75762,126M,4; MQ:i:60
D00691:39:C7HGRANXX:7:1210:10697:93407 83 chr10 147812 5 23M100S = 147812 -23 GGCAACAGAGGAGGGAAAGGAGATGATTTTCCCTGGTGGAAACGGATGCAAAAGGGAGAATTTCCTTGGGACGACAAGGACTTCCGGAGCCTGGCTGTTTTGGGGGCTGGTGTGGCTGCGGGG BFBFFFFFFFFFFFFFFFFF<FFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBBNM:i:1 MD:Z:2T20 AS:i:20 XS:i:0 MQ:i:5
D00691:39:C7HGRANXX:7:1210:10697:93407 163 chr10 147812 5 23M100S = 147812 23 GGCAACAGAGGAGGGAAAGGAGATGATTTTCCCTGGTGGAAACGGATGCAAAAGGGAGAATTTCCTTGGGACGACAAGGACTTCCGGAGCCTGGCTGTTTTGGGGGCTGGTGTGGCTGCGGGG BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFF<BFFFFFFFFFFFFFFFFFFFFFFFNM:i:1 MD:Z:2T20 AS:i:20 XS:i:0 MQ:i:5
D00691:39:C7HGRANXX:7:1209:19089:96642 83 chr10 147815 0 24S20M81S = 147815 -20 GCTGGCCCTGGAGGAGATGGAGGCAACAGAGGAGGGAAAGGAGATGATTTTCCCTGGTGGAAACGGATGCAAAAGGGAGAATTTCCTTGGGACGACAAGGACTTCCGGAGCCTGGCTGTTTTGGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB NM:i:0 MD:Z:20 AS:i:20 XS:i:19 XA:Z:chrY,+13928508,79S19M27S,0;chr1,-180946205,28S19M78S,0;chr6,+40594269,78S19M28S,0; MQ:i:6
D00691:39:C7HGRANXX:7:1209:19089:96642 163 chr10 147815 6 56S20M50S = 147815 20 ATTCAGTACCTCCAAAAAAAGAACCAAAAAATGCTGGCCCTGGAGGAGATGGAGGCAACAGAGGAGGGAAAGGAGATGATTTTCCCTGGTGGAAACGGATGCAAAAGGGAGAATTTCCTTGGGACG BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:20 AS:i:20 XS:i:0 MQ:i:0
D00691:39:C7HGRANXX:7:2101:5430:95369 99 chr10 153722 60 126M = 153784 188 TAAAGACTTCTAAAATCTTGGGATGCAGCTGAGATTGCTATTTAAGTGGACAGAGCATTATAGGCCAGCTCCTTTGTGAGCTTCATACCCTACATGTGTGCTACTTTCAGCATGACCTGCCTCTAC BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:126 AS:i:126 XS:i:0 MQ:i:60
D00691:39:C7HGRANXX:7:2101:5430:95369 147 chr10 153784 60 126M = 153722 -188 GGCCAGCTCCTTTGTGAGCTTCATACCCTACATGTGTGCTACTTTCAGCATGACCTGCCTCTACTTTTTGCAGGGGGACTACAATGTGGGTTTGGATGGCAAAAGCTCCTGGAGTGATTCCCTAGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB NM:i:0 MD:Z:126 AS:i:126 XS:i:20 MQ:i:60
D00691:39:C7HGRANXX:7:2310:19944:26870 99 chr10 162451 0 39S19M17S = 162451 19 GACACATCAACCATAGCCCAAAACAACAGGCAGCCCGGCAGCTGTTGCCCTCACTGTTGCCTATTACCTTGGGGG BBBBBFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFF/FFFB/FB<FFFFFFFFFFFFBBFF NM:i:0 MD:Z:19 AS:i:19 XS:i:19 XA:Z:chr2,-217642450,11S19M45S,0; MQ:i:0
产生.sam的命令即bowtie2的比对命令
mkdir -p /home/vip41/reference/geneme
cd /home/vip41/reference/geneme
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gzip -d hg38.fa.gz
cat *.fa hg38
13、根据上面的命令,找到我使用的参考基因组 /home/jianmingzeng/reference/index/bowtie/hg38 具体有多少条染色体。
本题使用的命令:samtools view -H single/tmp.sorted.bam |awk '{print $2}'|cut -c4-9|sort -n|uniq -c|grep -v '_'
awk命令使用:
awk '{pattern + action}' {filenames}
pattern 表示 AWK 在数据中查找的内容,而 action 是在找到匹配内容时所执行的一系列命令。
awk工作流程是这样的:读入有'\n'换行符分割的一条记录,然后将记录按指定的域分隔符划分域,填充域,1表示第一个域,$n表示第n个域。默认域分隔符是"空白键" 或 "[tab]键"。
grep -v : 反转显示所有不满足查找模式的行
参考文章:AWK:https://www.cnblogs.com/ggjucheng/archive/2013/01/13/2858470.html
uniq: http://man.linuxde.net/uniq
vip41@VM-0-15-ubuntu:~/biosoft/data1/rmDuplicate/samtools$ samtools view -H single/tmp.sorted.bam |awk '{print $2}'|cut -c4-9|sort -n|uniq -c|grep -v '_'
1 bowtie
1 chr1
1 chr10
1 chr11
1 chr12
1 chr13
1 chr14
1 chr15
1 chr16
1 chr17
1 chr18
1 chr19
1 chr2
1 chr20
1 chr21
1 chr22
1 chr3
1 chr4
1 chr5
1 chr6
1 chr7
1 chr8
1 chr9
1 chrM
1 chrX
1 chrY
1 1.0
vip41@VM-0-15-ubuntu:~/biosoft/data1/rmDuplicate/samtools$ samtools view -H single/tmp.sorted.bam |awk '{print $2}'|cut -c4-9|sort -n|uniq -c|grep -v '_'|wc
27 54 365
去除前后两个,共25条染色体
14、上面的后缀为BAM 的文件的第二列,只有 0 和 16 两个数字,用 cut/sort/uniq等命令统计它们的个数。
vip41@VM-0-15-ubuntu:~/linux练习题/rmDuplicate/samtools/single$ samtools view tmp.sorted.bam |cut -f2|sort|uniq -c
29 0
24 16
15、重新打开 rmDuplicate/samtools/paired 文件夹下面的后缀为BAM 的文件,再次查看第二列,并且统计。
vip41@VM-0-15-ubuntu:~/linux练习题/rmDuplicate/samtools/paired$ samtools view tmp.sorted.bam | cut -f2|sort -n |uniq -c
3 83
2 97
9 99
8 147
3 163
1 323
1 353
1 371
1 387
1 433
16、下载 http://www.biotrainee.com/jmzeng/sickle/sickle-results.zip 文件,并且解压,查看里面的文件夹结构, 这个文件有2.3M,注意留心下载时间及下载速度。
vip41@VM-0-15-ubuntu:~/linux练习题/rmDuplicate/samtools/paired$ wget http://www.biotrainee.com/jmzeng/sickle/sickle-results.zip
--2018-12-14 23:25:52-- http://www.biotrainee.com/jmzeng/sickle/sickle-results.zip
Resolving www.biotrainee.com (www.biotrainee.com)... 123.206.72.184
Connecting to www.biotrainee.com (www.biotrainee.com)|123.206.72.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2391084 (2.3M) [application/zip]
Saving to: ‘sickle-results.zip’
sickle-results.zi 100%[===========>] 2.28M 105KB/s in 24s
2018-12-14 23:26:17 (97.3 KB/s) - ‘sickle-results.zip’ saved [2391084/2391084]
vip41@VM-0-15-ubuntu:~/linux练习题/rmDuplicate/samtools/paired$ mv sickle-results.zip ~/linux练习题/
vip41@VM-0-15-ubuntu:~/linux练习题$ unzip sickle-results.zip
Archive: sickle-results.zip
creating: sickle-results/
inflating: sickle-results/command.txt
inflating: sickle-results/single_tmp_fastqc.html
inflating: sickle-results/single_tmp_fastqc.zip
inflating: sickle-results/test1_fastqc.html
inflating: sickle-results/test1_fastqc.zip
inflating: sickle-results/test2_fastqc.html
inflating: sickle-results/test2_fastqc.zip
inflating: sickle-results/trimmed_output_file1_fastqc.html
inflating: sickle-results/trimmed_output_file1_fastqc.zip
inflating: sickle-results/trimmed_output_file2_fastqc.html
inflating: sickle-results/trimmed_output_file2_fastqc.zip
vip41@VM-0-15-ubuntu:~/linux练习题$ cd sickle-results/
vip41@VM-0-15-ubuntu:~/linux练习题/sickle-results$ tree
.
├── command.txt
├── single_tmp_fastqc.html
├── single_tmp_fastqc.zip
├── test1_fastqc.html
├── test1_fastqc.zip
├── test2_fastqc.html
├── test2_fastqc.zip
├── trimmed_output_file1_fastqc.html
├── trimmed_output_file1_fastqc.zip
├── trimmed_output_file2_fastqc.html
└── trimmed_output_file2_fastqc.zip
0 directories, 11 files
17、解压 sickle-results/single_tmp_fastqc.zip 文件,并且进入解压后的文件夹,找到 fastqc_data.txt 文件,并且搜索该文本文件以 >>开头的有多少行?
vip41@VM-0-15-ubuntu:~/linux练习题/sickle-results$ unzip single_tmp_fastqc.zip
Archive: single_tmp_fastqc.zip
creating: single_tmp_fastqc/
creating: single_tmp_fastqc/Icons/
creating: single_tmp_fastqc/Images/
inflating: single_tmp_fastqc/Icons/fastqc_icon.png
inflating: single_tmp_fastqc/Icons/warning.png
inflating: single_tmp_fastqc/Icons/error.png
inflating: single_tmp_fastqc/Icons/tick.png
inflating: single_tmp_fastqc/summary.txt
inflating: single_tmp_fastqc/Images/per_base_quality.png
inflating: single_tmp_fastqc/Images/per_tile_quality.png
inflating: single_tmp_fastqc/Images/per_sequence_quality.png
inflating: single_tmp_fastqc/Images/per_base_sequence_content.png
inflating: single_tmp_fastqc/Images/per_sequence_gc_content.png
inflating: single_tmp_fastqc/Images/per_base_n_content.png
inflating: single_tmp_fastqc/Images/sequence_length_distribution.png
inflating: single_tmp_fastqc/Images/duplication_levels.png
inflating: single_tmp_fastqc/Images/adapter_content.png
inflating: single_tmp_fastqc/Images/kmer_profiles.png
inflating: single_tmp_fastqc/fastqc_report.html
inflating: single_tmp_fastqc/fastqc_data.txt
inflating: single_tmp_fastqc/fastqc.fo
vip41@VM-0-15-ubuntu:~/linux练习题/sickle-results$ ls
command.txt test2_fastqc.html
single_tmp_fastqc test2_fastqc.zip
single_tmp_fastqc.html trimmed_output_file1_fastqc.html
single_tmp_fastqc.zip trimmed_output_file1_fastqc.zip
test1_fastqc.html trimmed_output_file2_fastqc.html
test1_fastqc.zip trimmed_output_file2_fastqc.zip
vip41@VM-0-15-ubuntu:~/linux练习题/sickle-results$ cd single_tmp_fastqc/
vip41@VM-0-15-ubuntu:~/linux练习题/sickle-results/single_tmp_fastqc$ ls
fastqc_data.txt fastqc_report.html Images
fastqc.fo Icons summary.txt
vip41@VM-0-15-ubuntu:~/linux练习题/sickle-results/single_tmp_fastqc$ cat fastqc_data.txt | grep '^>>'|wc -l
24
18、下载 http://www.biotrainee.com/jmzeng/tmp/hg38.tss 文件,去NCBI找到TP53/BRCA1等自己感兴趣的基因对应的 refseq数据库 ID,然后找到它们的hg38.tss 文件的哪一行。
vip41@VM-0-15-ubuntu:~/linux练习题$ wget http://www.biotrainee.com/jmzeng/tmp/hg38.tss
--2018-12-14 23:36:06-- http://www.biotrainee.com/jmzeng/tmp/hg38.tss
Resolving www.biotrainee.com (www.biotrainee.com)... 123.206.72.184
Connecting to www.biotrainee.com (www.biotrainee.com)|123.206.72.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2625188 (2.5M)
Saving to: ‘hg38.tss’
hg38.tss 100%[===========>] 2.50M 253KB/s in 23s
2018-12-14 23:36:29 (114 KB/s) - ‘hg38.tss’ saved [2625188/2625188]
vip41@VM-0-15-ubuntu:~/linux练习题$ cat hg38.tss | grep -n "NM_001276696"
22181:NM_001276696 chr17 7685550 7689550 1
19、解析hg38.tss 文件,统计每条染色体的基因个数。
vip41@VM-0-15-ubuntu:~/linux练习题$ cat hg38.tss |cut -f2|sort|uniq -c|grep -v '_'
6050 chr1
2824 chr10
3449 chr11
2931 chr12
1122 chr13
1883 chr14
2168 chr15
2507 chr16
3309 chr17
873 chr18
3817 chr19
4042 chr2
1676 chr20
868 chr21
1274 chr22
3277 chr3
2250 chr4
2684 chr5
3029 chr6
2720 chr7
2069 chr8
2301 chr9
2 chrM
2553 chrX
414 chrY
20、解析hg38.tss 文件,统计NM和NR开头的熟练,了解NM和NR开头的含义。
vip41@VM-0-15-ubuntu:~/linux练习题$ cat hg38.tss |awk '{print$1}'|cut -c1-2|sort|uniq -c
51064 NM
15954 NR
友情阅读推荐:
- 强烈推荐参加生信技能树(爆款入门培训课)全国巡讲 ,课程详情见:https://mp.weixin.qq.com/s/Z9sdxgvFj0XJjYaW_5yHXg 各大城市均有开课,随时随地报名。
- 生信技能树公益视频合辑:学习顺序是linux,r,软件安装,geo,小技巧,ngs组学!
B站链接:https://m.bilibili.com/space/338686099 - 学徒培养详见:https://mp.weixin.qq.com/s/3jw3_PgZXYd7FomxEMxFmw