一、获得fastq文件
参考:bam文件转换fastq
PS: 我获得的公司汇报的数据是 .bam 格式的, 所以需要将bam文件转换fastq
我的bam文件在 /home/lchen/circ/ ;
讲bam转换为fastqc, 用 bedtools
我的fastq文件将存放在 /home/lchen/hisat2/rna_seq/
**语法**
bamToFastq -i unmapped.bam -fq unmapped.fastq
实例:
(rna) lchen@Vostro-5460~$ bamToFastq -i /home/lchen/circ/c1.bam -fq /home/lchen/hisat2/rna_seq/c1.fastq
参考:Y大宽
我的fastq文件在 /home/lchen/hisat2/rna_seq/
我的index在 /home/lchen/reference/hisat2_index/hg38/genome
比对后得到的sam文件会存放在 /home/lchen/hisat2/aligned/
构建索引:
建议直接下载hisat2官网的索引文件
下载地址:https://ccb.jhu.edu/software/hisat2/index.shtml
解压缩你会看到
(rna) lchen@Vostro-5460:~/reference/hisat2_index$ tar -zxvf hg38.tar.gz.gz
hg38/
hg38/make_hg38.sh
hg38/genome.8.ht2
hg38/genome.5.ht2
hg38/genome.7.ht2
hg38/genome.6.ht2
hg38/genome.4.ht2
hg38/genome.3.ht2
hg38/genome.1.ht2
hg38/genome.2.ht2
或用hisat2-build构建一个:
hisat2-build -p 20 hg38.fa hg38
-p 20 p为线程, 一般多少核去运行,这个看自己电脑的内存,
最后,将基因组、转录组、SNP建立索引:
**语法**
hisat2-build -p 4 genome.fa --snp genome.snp --ss genome.ss --exon genome.exon genome_snp_tran
**实例**
hisat2-build -p 4 hg38.fa --snp hg38.snp --ss hg38e.ss --exon hg38.exon hg38_snp_tran
单端数据比对的用法如下:
hisat -x hg19 -p 20 -U reads.fq -S align.sam
x是参考基因组索引文件的前缀, U为单端数据 , S为输出sam文件
实例
hisat2 -x /home/lchen/reference/hisat2_index/hg38/genome -p 20 -U /home/lchen/hisat2/rna_seq/c1.fastq -S /home/lchen/hisat2/aligned/alignc1.sam
50803150 reads; of these:
50803150 (100.00%) were unpaired; of these:
2477095 (4.88%) aligned 0 times
42642733 (83.94%) aligned exactly 1 time
5683322 (11.19%) aligned >1 times
95.12% overall alignment rate
(rna) lchen@Vostro-5460:~$ hisat2 -x /home/lchen/reference/hisat2_index/hg38/genome -p 20 -U /home/lchen/hisat2/rna_seq/c2.fastq -S /home/lchen/hisat2/aligned/alignc2.sam
51949025 reads; of these:
51949025 (100.00%) were unpaired; of these:
2615007 (5.03%) aligned 0 times
46117658 (88.77%) aligned exactly 1 time
3216360 (6.19%) aligned >1 times
94.97% overall alignment rate
hisat2 -x /home/lchen/reference/hisat2_index/hg38/genome -p 20 -U /home/lchen/hisat2/rna_seq/d1.fastq -S /home/lchen/hisat2/aligned/alignd1.sam
49316177 reads; of these:
49316177 (100.00%) were unpaired; of these:
2458874 (4.99%) aligned 0 times
42248554 (85.67%) aligned exactly 1 time
4608749 (9.35%) aligned >1 times
95.01% overall alignment rate
双端数据用法如下:
hisat -x hg19 -p 20 -1 R1.fq -2 R2.fq -S align.sam
参考: 酷睿_1991
接下来
用samtools将sam格式转位bam格式
samtools view -S /home/lchen/hisat2/aligned/alignd1.sam -b > d1.bam
**sort完之后, bam文件至少可以缩小2~3倍**
samtools sort -@ 5 -o /home/lchen/circ/sorted/d1_sorted.bam /home/lchen/circ/sorted/alignd1.sam
sort已有的bam文件:
samtools sort -@ 5 -o /home/lchen/circ/sorted/c1_sorted.bam /home/lchen/circ/c1.bam
merge所需要的bam:
samtools merge -@ 6 -c /home/lchen/circ/sorted/merged.bam /home/lchen/circ/sorted/*_sorted.bam
再次sort:
samtools sort -@ 5 -o /home/lchen/circ/sorted/sorted_merged.bam /home/lchen/circ/sorted/merged.bam
再次index一次:
samtools index /home/lchen/circ/sorted/sorted_merged.bam
remove duplicates:
samtools rmdup -s /home/lchen/circ/sorted/sorted_merged.bam /home/lchen/circ/sorted/sorted.rmdup.bam
最后一次index;
samtools index /home/lchen/circ/sorted/sorted.rmdup.bam
到了这一步基本上就处理的差不多了,可以进行call variation了,但是这里还有一步建立索引,这十分的重要!!!!
必须对bam文件进行排序后,才能进行index,否则会报错。建立索引后将产生后缀为.bai的文件,用于快速的随机处理。很多情况下需要有bai文件的存在,特别是显示序列比对情况下。建立索引很简单。
参考:十三而舍
flagstat :
samtools flagstat /home/lchen/circ/sorted/sorted_merged.bam > /home/lchen/circ/sorted/sorted.flagstat.txt
depth:
686 samtools depth /home/lchen/circ/sorted/sorted_merged.bam > /home/lchen/circ/sorted/sorted.depth
mpileup:
samtools mpileup -f /home/lchen/reference/samtools_index/hg38.fa /home/lchen/circ/sorted/sorted_merged.bam > /home/lchen/circ/sorted/sorted.mpileup.txt