Coverage Depth
覆盖深度 mapping depth
基因组被测序片段(短读 short reads)“覆盖”的强度有多大?
每一碱基的覆盖率是基因组碱基被测序的平均次数。 基因组的覆盖深度是通过与基因组匹配的所有短读的碱基数目除以该基因组的长度来计算的。 它通常表示为1X、2X、3X、...(1、2或3倍覆盖)。
此处通常被称为测序深度(sequencing depth)或者覆盖深度(depth of coverage)。
Depth 计算公式
原始测序数据的深度,Depth = (# sequenced bases) / (# bases of reference) 或者
比对上的数据的深度,Depth= (# bases of all mapped reads) / (# bases of reference).
覆盖范围 covered length
短读“覆盖”了基因组的多少区域?是否有些区域没有任何的短读覆盖?
覆盖范围是参考基因组被一定深度覆盖的碱基的百分比。例如,一个基因组的90%区域在1X深度覆盖,另外仍然有70%的区域被覆盖了5X深度。
此处通常称为覆盖度(genome covarage)或者覆盖范围(breadth of coverage)。
Coverage 计算公式
Coverage = (# area covered by mapped reads) / (# area of reference).
例子
对于全基因组
Depth = (6 * 28nt) / 112nt = 1.2 fold
Coverage = (46nt - 5nt) / 112nt = 36.6%
对于target区域
Depth = (6 * 28nt) / 46nt = 3.7 fold
Coverage = (46nt - 5nt) / 46nt = 89.1%
对于position
Depth = 6 fold
reads数
下机的fq文件中到底有多少条reads?可以使用软件readfq来统计:
git clone https://github.com/billzt/readfq
gcc -o kseq_fastq_base kseq_fastq_base.c -lz
kseq_fastq_base samp.fq
#FASTQ files could be gzipped.
#The statistic results would be printed on STDOUT.
功能单一,优化了速度:
It could deal with ~4M reads (1G base) in less than 40 seconds, ~50M reads (14G base) in less than 5 minutes!!!
现在已有很多工具都会提供这个功能,间接或者直接计算测序深度和覆盖度,比如:samtools、sambamba、bedtools、qualimap、gatk等等,下面对各个工具的使用和结果做一点点汇整。
计算测序深度和覆盖度
samtools or sambamba
- depth可以得到每一个碱基位点上的测序深度:
samtools depth samp.bam > base_depth.txt
sambamba depth -t 10 -w 500 samp.bam > 500base_depth.txt
# -w breadth of the window, in bp, 计算窗口的平均深度
- stats对比对后的bam文件进行全面的统计
samtools stats --threads 10 samp.bam > samp_bamstat.txt
grep "^SN" samp_bamstat.txt
SN raw total sequences: 17870982
SN filtered sequences: 0
SN sequences: 17870982
SN is sorted: 0
SN 1st fragments: 8935491
SN last fragments: 8935491
SN reads mapped: 17870982
SN reads mapped and paired: 17870982 # paired-end technology bit set + both mates mapped
SN reads unmapped: 0
SN reads properly paired: 5208824 # proper-pair bit set
SN reads paired: 17870982 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 556465 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 0
SN total length: 1804969182 # ignores clipping
SN bases mapped: 1804969182 # ignores clipping
SN bases mapped (cigar): 1804969182 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 9941932 # from NM fields
SN error rate: 5.508089e-03 # mismatches / bases mapped (cigar)
SN average length: 101
SN maximum length: 101
SN average quality: 35.3
SN insert size average: 1330.1
SN insert size standard deviation: 1341.6
SN inward oriented pairs: 1418095
SN outward oriented pairs: 564408
SN pairs with other orientation: 6745980
SN pairs on different chromosomes: 207008
- 接下来可以使用
plot-bamstats
进行结果的可视化,plot-bamstats
是samtools中包含的一个软件,可以在samtools的根目录找到,查看使用:
plot-bamstats --help
About: Parses output of samtools stats (former bamcheck) and calls gnuplot to create graphs.
Usage: plot-bamstats [OPTIONS] file.bam.bc
plot-bamstats -p outdir/ file.bam.bc
Options:
-k, --keep-files Do not remove temporary files.
-l, --log-y Set the Y axis scale of the Insert Size plot to log 10.
-m, --merge Merge multiple bamstats files and output to STDOUT.
-p, --prefix <path> The output files prefix, add a slash to create new directory.
-r, --ref-stats <file.fa.gc> Optional reference stats file with expected GC content (created with -s).
-s, --do-ref-stats <file.fa> Calculate reference sequence GC for later use with -r
-t, --targets <file.tab> Restrict -s to the listed regions (tab-delimited chr,from,to. 1-based, inclusive)
-h, -?, --help This help message.
例子如下:
plot-bamstats -s /home/luna/Desktop/database/homo_bwa/hsa.fa > gc
plot-bamstats -r gc -p ./test samp_bamstat.txt
先计算基因组的gc含量分布,再给出一系列文件和图片,还有一个html文件:
ll test* gc
-rw-rw-r-- 1 luna luna 1910 Oct 14 16:04 gc
-rw-rw-r-- 1 luna luna 4266 Oct 14 16:04 test-acgt-cycles.gp
-rw-rw-r-- 1 luna luna 16707 Oct 14 16:04 test-acgt-cycles.png
-rw-rw-r-- 1 luna luna 429 Oct 14 16:04 test-coverage.gp
-rw-rw-r-- 1 luna luna 10497 Oct 14 16:04 test-coverage.png
-rw-rw-r-- 1 luna luna 4736 Oct 14 16:04 test-gc-content.gp
-rw-rw-r-- 1 luna luna 21895 Oct 14 16:04 test-gc-content.png
-rw-rw-r-- 1 luna luna 1026 Oct 14 16:04 test-gc-depth.gp
-rw-rw-r-- 1 luna luna 23312 Oct 14 16:04 test-gc-depth.png
-rw-rw-r-- 1 luna luna 4606 Oct 14 16:04 test.html
-rw-rw-r-- 1 luna luna 3528 Oct 14 16:04 test-indel-cycles.gp
-rw-rw-r-- 1 luna luna 28786 Oct 14 16:04 test-indel-cycles.png
-rw-rw-r-- 1 luna luna 1739 Oct 14 16:04 test-indel-dist.gp
-rw-rw-r-- 1 luna luna 29595 Oct 14 16:04 test-indel-dist.png
-rw-rw-r-- 1 luna luna 261710 Oct 14 16:04 test-insert-size.gp
-rw-rw-r-- 1 luna luna 16580 Oct 14 16:04 test-insert-size.png
-rw-rw-r-- 1 luna luna 5939 Oct 14 16:04 test-quals2.gp
-rw-rw-r-- 1 luna luna 21727 Oct 14 16:04 test-quals2.png
-rw-rw-r-- 1 luna luna 76931 Oct 14 16:04 test-quals3.gp
-rw-rw-r-- 1 luna luna 84648 Oct 14 16:04 test-quals3.png
-rw-rw-r-- 1 luna luna 2228 Oct 14 16:04 test-quals.gp
-rw-rw-r-- 1 luna luna 47463 Oct 14 16:04 test-quals-hm.gp
-rw-rw-r-- 1 luna luna 34730 Oct 14 16:04 test-quals-hm.png
-rw-rw-r-- 1 luna luna 15932 Oct 14 16:04 test-quals.png
可以打开html文件展示如下:
点击小图,也会在旁边展示大图,可以更清晰查看。
BAMStats
BAMStats是建立在Picard Java API上的简单软件工具,可以计算并以图形方式显示从SAM / BAM文件进行质量控制评估时得到的各种指标。
BAMStats为覆盖范围,起始位置,MAPQ值,映射的读取长度和编辑距离提供描述性统计信息。 如果提供了合适的bed或gtf文件,还可以为单个特征(例如外显子或诱饵区域)生成统计指标。
BAMStats需要分配2至6GB的内存(Java程序可以使用-Xmx来设置);另外SAM / BAM文件都可以作为输入,但是必须是排序后的文件。
- 下载
wget https://jaist.dl.sourceforge.net/project/bamstats/BAMStats-1.25.zip
unzip BAMStats-1.25.zip
rm BAMStats-1.25.zip
cd BAMStats-1.25 && ls
## 两个jar分别是命令行(CLI)和图形用户(GUI)
java -jar BAMStats-1.25/BAMStats-1.25.jar --help
USAGE:
======
Option Description
------ -----------
-d, --distances If selected, edit distance statistics
will also be displayed as a separate
table (optional).
-f, --features Feature file specifying specific
locations to calculate coverage
metrics for (optional)
-h, --help Generate help.
-i, --infile SAM or BAM infile (must be sorted).
-l, --lengths If selected, mapped read length
statistics will also be displayed as
a separate table (optional).
-m, --mapped If selected, coverage statistics, for
mapped regions only, will also be
displayed as a separate table
(optional).
-o, --outfile Outfile (optional).
-q, --qualities If selected, mapping quality (MAPQ)
statistics will also be displayed as
a separate table (optional).
-s, --starts If selected, start position statistics
will also be displayed as a separate
table (optional).
-v, --view View option for output format
(currently accepts 'simple' or
'html'; default, simple).
- 使用
sambamba sort -o smap.sort.bam -t 20 -p samp.bam
java -jar -Xmx30g BAMStats-1.25.jar -i samp.sort.bam -o BAMStats --view simple
cat BAMStats
ref | N | mean | median | sd | q1 | q3 | 2.50% | 97.50% | min | max |
---|---|---|---|---|---|---|---|---|---|---|
chr1 | 249,250,621 | 0.63 | 0 | 11.29 | 0 | 1 | 0 | 3 | 0 | 15051 |
chr2 | 243,199,373 | 0.63 | 0 | 1.43 | 0 | 1 | 0 | 3 | 0 | 706 |
chr3 | 198,022,430 | 0.64 | 0 | 1 | 0 | 1 | 0 | 3 | 0 | 102 |
chr4 | 191,154,276 | 0.58 | 0 | 1.13 | 0 | 1 | 0 | 3 | 0 | 527 |
... | ||||||||||
chr22 | 51,304,566 | 0.36 | 0 | 0.91 | 0 | 0 | 0 | 3 | 0 | 207 |
chrX | 155,270,560 | 0.59 | 0 | 1.29 | 0 | 1 | 0 | 3 | 0 | 832 |
chrY | 59,373,566 | 0.07 | 0 | 4.6 | 0 | 0 | 0 | 0 | 0 | 1859 |
chrMT | 16,571 | 26.14 | 22 | 15.47 | 16 | 32 | 7 | 69.7 | 0 | 98 |
改为html输出,可以看看:
java -jar -Xmx30g BAMStats-1.25.jar -d -l -m -q -s -i samp.sort.bam -o BAMStats2 --view html
会产生一个文件夹BAMStats2.data
和一个html文件BAMStats2
,进入查看:
cd BAMStats2.data && ls | wc -l
# 531
ll chr1_*
-rw-rw-r-- 1 luna luna 5056 Oct 14 16:43 chr1_Coverage_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 11480 Oct 14 16:43 chr1_Coverage_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 12513 Oct 14 16:43 chr1_Coverage_histogram.png
-rw-rw-r-- 1 luna luna 753 Oct 14 16:43 chr1_Coverage.html
-rw-rw-r-- 1 luna luna 4528 Oct 14 16:44 chr1_EditDistances_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 10601 Oct 14 16:44 chr1_EditDistances_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 11156 Oct 14 16:44 chr1_EditDistances_histogram.png
-rw-rw-r-- 1 luna luna 770 Oct 14 16:44 chr1_EditDistances.html
-rw-rw-r-- 1 luna luna 5852 Oct 14 16:43 chr1_MappedCoverage_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 12311 Oct 14 16:43 chr1_MappedCoverage_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 12688 Oct 14 16:43 chr1_MappedCoverage_histogram.png
-rw-rw-r-- 1 luna luna 784 Oct 14 16:43 chr1_MappedCoverage.html
-rw-rw-r-- 1 luna luna 5107 Oct 14 16:44 chr1_MappingQuality_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 11297 Oct 14 16:44 chr1_MappingQuality_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 11698 Oct 14 16:44 chr1_MappingQuality_histogram.png
-rw-rw-r-- 1 luna luna 771 Oct 14 16:44 chr1_MappingQuality.html
-rw-rw-r-- 1 luna luna 4979 Oct 14 16:44 chr1_ReadLengths_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 10979 Oct 14 16:44 chr1_ReadLengths_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 10979 Oct 14 16:44 chr1_ReadLengths_histogram.png
-rw-rw-r-- 1 luna luna 773 Oct 14 16:44 chr1_ReadLengths.html
-rw-rw-r-- 1 luna luna 463 Oct 14 16:43 chr1_StartPositions.html
查看html文件BAMStats2
,重命名,同时下载html和文件夹BAMStats2.data
到本地,放在同一个目录下,以后使用浏览器打开:
mv BAMStats2 BAMStats2.html
tar -cf Stats2.tar BAMStats2*
sz Stats2.tar
# 本地解压
打开BAMStats2.html
,部分结果展示:
也可以到文件夹BAMStats2.data
中查看每一条染色体:
GUI界面也可以看下图:
GATK DepthOfCoverage
gatk3和gatk4中都有这个工具,DepthOfCoverage,本处使用gatk 4.1.8.1:
which gatk
/home/luna/Desktop/Software/gatk/gatk-4.1.8.1/gatk
# gatk4
gatk DepthOfCoverage -R /home/luna/Desktop/database/homo_bwa/hg19.fa -I samp.sort.bam -O gatk -L chr1 --summary-coverage-threshold 1 --summary-coverage-threshold 4
# gatk3
java -jar /home/luna/Desktop/Software/gatk/GenomeAnalysisTK.jar -T DepthOfCoverage -I samp.sort.bam -R /home/luna/Desktop/database/homo_bwa/hg19.fa --out gatk3 --summaryCoverageThreshold 1 --summaryCoverageThreshold 4 -L chr1
# -L 后面可以接单个染色体的名称,也可以多次使用,或者bed文件包含基因组的多个区域作为限定
# summary Coverage Threshold,可以随意设置数值(N),表示%_above_N,即深度超过N的coverage of breadth.
运行结束会生成多个文件,主要是查看一下summary文件:
no suffix: per locus coverage
_summary: total, mean, median, quartiles, and threshold proportions, aggregated over all bases
_statistics: coverage histograms (# locus with X coverage), aggregated over all bases
_interval_summary: total, mean, median, quartiles, and threshold proportions, aggregated per interval
_interval_statistics: 2x2 table of # of intervals covered to >= X depth in >=Y samples
_gene_summary: total, mean, median, quartiles, and threshold proportions, aggregated per gene
_gene_statistics: 2x2 table of # of genes covered to >= X depth in >= Y samples
_cumulative_coverage_counts: coverage histograms (# locus with >= X coverage), aggregated over all bases
_cumulative_coverage_proportions: proprotions of loci with >= X coverage, aggregated over all bases
## 查看
cat gatk.sample_summary
sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_1,%_bases_above_4
SRX247249,155122543,0.69,2,1,1,41.8,2.2
Total,155122543,0.69,N/A,N/A,N/A,N/A,N/A
BamStats
此处有多个bamstats的程序,所以作为区分,列出来。
Biopet BamStats
Biopet BamStats is a package that contains tools to generate stats from a BAM file, merge those stats for multiple samples, and validate the generated stats files.
Mode - Generate
Generate reports clipping stats, flag stats, insert size and mapping quality on a BAM file.
The output of the JSON file is organized in a sample - library - readgroup tree structure. If readgroups in the BAM file are not annotated with sample (SM) and library (LB) tags an error will be thrown. This can be fixed by using samtools addreplacerg or picard AddOrReplaceReadGroups.
Mode - Merge
This module will merge bamstats files together and keep the sample/library/readgroup structure. Values for the same readgroups will be added. It will also validate the resulting file.
Mode - Validate
Validates a BamStats file. If aggregation values can not be regenerated the file is considered corrupt. This should only happen when the file has been manually edited.
website: https://biopet.github.io/bamstats/1.0.1/index.html
wget https://github.com/biopet/bamstats/releases/download/v1.0.1/BamStats-assembly-1.0.1.jar
mv BamStats-assembly-1.0.1.jar BamStats-biopet.jar
mkdir biopet
java -jar /home/luna/Desktop/Software/StatsBAM/BamStats-biopet.jar generate -R /home/luna/Desktop/database/homo_bwa/hg19.fa -o biopet -b samp.sort.bam --tsvOutputs
# 会在biopet生成文件
3prime_clipping.histogram.tsv
5prime_clipping.stats.tsv
clipping.histogram.tsv
insertsize.histogram.tsv
left_clipping.stats.tsv
right_clipping.histogram.tsv
3prime_clipping.stats.tsv
bamstats.json
clipping.stats.tsv
insertsize.stats.tsv
mappingQuality.histogram.tsv
right_clipping.stats.tsv
5prime_clipping.histogram.tsv
bamstats.summary.json
flagstats.tsv
left_clipping.histogram.tsv
mappingQuality.stats.tsv
结果很详细,但是对于json文件我没有找到合适的后续处理,这里仅记录。
guigolab bamstats
wget -O - https://github.com/guigolab/bamstats/releases/download/v0.3.5/bamstats-v0.3.5-linux-x86_64.tar.gz | tar xz -C ./ bamstats
/home/luna/Desktop/Software/StatsBAM/bamstats/bamstats -h
bamstats - compute mapping statistics
Usage:
bamstats [flags]
Flags:
-a, --annotaion string element annotation file
-c, --cpu int number of cpus to be used (default 80)
-h, --help help for bamstats
-i, --input string input file (required)
--loglevel string logging level (default "warn")
--max-buf int maximum number of buffered records (default 1000000)
-o, --output string output file (default "-")
-n, --reads int number of records to process (default -1)
-u, --uniq output genomic coverage statistics for uniqely mapped reads too
--version version for bamstats
Bamstats can currently compute the following mapping statistics:
general
genome coverage
RNA-seq
使用,具体上github查看,https://github.com/guigolab/bamstats/blob/master/scripts/
/home/luna/Desktop/Software/StatsBAM/bamstats/bamstats -i samp.sort.bam --cpu 20 --uniq --output guigolab.bamstats
bamdst
Bamdst is a lightweight tool to stat the depth coverage of target regions of bam file(s).
Bam file(s) should be properly sorted, and the probe file (bed file) and the output dir must be specified in the first place.
下载安装
git clone https://github.com/shiquan/bamdst.git
cd bamdst/
make
使用
在使用该软件之前,我们首先制作一个bed文件:
cat chr1.region.bed
chr1 64909787 160398769
## PS中间空白是TAB,不是空格
查看使用说明
/home/luna/Desktop/Software/StatsBAM/bamdst/bamdst -h
bamdst version: 1.0.9
USAGE : bamdst [OPTION] -p <probe.bed> -o <output_dir> [in1.bam [in2.bam ... ]]
or : bamdst [OPTION] -p <probe.bed> -o <output_dir> -
Option -o and -p are mandatory:
-o, --outdir output dir
-p, --bed probe or target regions file, the region file will
be merged before calculate depths
测试
mkdir ./bamdst
/home/luna/Desktop/Software/StatsBAM/bamdst/bamdst -p chr1.region.bed -o ./bamdst samp.sort.bam
cd ./bamdst
ls
# 列出生成的文件
-coverage.report This file contains all the coverage information of target and flank region, and reads stat information of the input file(s).
-cumu.plot Depth distrbution for plot.
-insert.plot Inferred insert size distribution for plot.
-chromosome.report Depth and coverage information of each chromosome.
-region.tsv.gz For each region in probe file (in.bed), average depth, median depth and coverage of these regions will be listed in the file.
-depth.tsv.gz For each position in the probe file(in.bed), three kinds depth value will be calculated, including raw depth, rmdup depth and the coverage depth.
-uncover.bed This bed file contains the bad covered or uncovered region in the input bam file(s) against the probe file.
查看coverage.report 和 chromosome.report
cat coverage.report
## The file was created by bamdst
## Version : 1.0.9
## Files : samp.sort.bam
[Total] Raw Reads (All reads) 17870982
[Total] QC Fail reads 0
[Total] Raw Data(Mb) 1804.97
[Total] Paired Reads 17870982
[Total] Mapped Reads 17870982
[Total] Fraction of Mapped Reads 100.00%
[Total] Mapped Data(Mb) 1804.97
[Total] Fraction of Mapped Data(Mb) 100.00%
[Total] Properly paired 5208824
[Total] Fraction of Properly paired 29.15%
[Total] Read and mate paired 17870982
[Total] Fraction of Read and mate paired 100.00%
[Total] Singletons 0
[Total] Read and mate map to diff chr 414016
[Total] Read1 8935491
[Total] Read2 8935491
[Total] Read1(rmdup) 8935491
[Total] Read2(rmdup) 8935491
[Total] forward strand reads 8940264
[Total] backward strand reads 8930718
[Total] PCR duplicate reads 0
[Total] Fraction of PCR duplicate reads 0.00%
[Total] Map quality cutoff value 20
[Total] MapQuality above cutoff reads 17045397
[Total] Fraction of MapQ reads in all reads 95.38%
[Total] Fraction of MapQ reads in mapped reads 95.38%
[Target] Target Reads 548096
[Target] Fraction of Target Reads in all reads 3.07%
[Target] Fraction of Target Reads in mapped reads 3.07%
[Target] Target Data(Mb) 55.36
[Target] Target Data Rmdup(Mb) 48.10
[Target] Fraction of Target Data in all data 3.07%
[Target] Fraction of Target Data in mapped data 3.07%
[Target] Len of region 95488982
[Target] Average depth 0.58
[Target] Average depth(rmdup) 0.50
[Target] Coverage (>0x) 32.72%
[Target] Coverage (>=4x) 1.87%
[Target] Coverage (>=10x) 0.09%
[Target] Coverage (>=30x) 0.01%
[Target] Coverage (>=100x) 0.00%
[Target] Target Region Count 1
[Target] Region covered > 0x 0
[Target] Fraction Region covered > 0x 0.00%
[Target] Fraction Region covered >= 4x 0.00%
[Target] Fraction Region covered >= 10x 0.00%
[Target] Fraction Region covered >= 30x 0.00%
[Target] Fraction Region covered >= 100x 0.00%
[flank] flank size 200
[flank] Len of region (not include target region) 200
[flank] Average depth 1.09
[flank] flank Reads 4
[flank] Fraction of flank Reads in all reads 0.00%
[flank] Fraction of flank Reads in mapped reads 0.00%
[flank] flank Data(Mb) 0.00
[flank] Fraction of flank Data in all data 0.00%
[flank] Fraction of flank Data in mapped data 0.00%
[flank] Coverage (>0x) 55.50%
[flank] Coverage (>=4x) 2.00%
[flank] Coverage (>=10x) 0.00%
[flank] Coverage (>=30x) 0.00%
[flank] Coverage (>=100x) 0.00%
cat chromosomes.report
#Chromosome | DATA(%) | Avg depth | Median | Coverage% | Cov 4x % | Cov 10x % | Cov 30x % | Cov 100x % |
---|---|---|---|---|---|---|---|---|
chr1 | 100 | 0.58 | 0 | 32.72 | 1.87 | 0.09 | 0.01 | 0 |
Qualimap
Qualimap是功能比较全的一款质控软件,提供GUI界面和命令行界面,可以对bam文件,RNA-seq,Counts数据质控,也支持比对数据,counts数据和表观数据的比较和聚类。
Qualimap:http://qualimap.bioinfo.cipf.es/doc_html/index.html
命令行文档:http://qualimap.bioinfo.cipf.es/doc_html/command_line.html#command-line
下载安装
wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.1.zip
unzip qualimap_v2.2.1.zip
正常使用qualimap,需要安装部分R包:
# 安装相关的R包
# NOISeq,Repitools, Rsamtools, GenomicFeatures, rtracklayer
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("NOISeq", version = "3.8")
BiocManager::install("Repitools", version = "3.8")
BiocManager::install("Rsamtools", version = "3.8")
BiocManager::install("GenomicFeatures", version = "3.8")
BiocManager::install("rtracklayer", version = "3.8")
使用
/home/luna/Desktop/Software/qualimap/qualimap bamqc --java-mem-size=40G -c -nt 30 -outdir ./ -outformat PDF -os -outfile samp.bamqc.pdf -bam samp.sort.bam -gd hg19 -nr 2000
mosdepth
Mosdepth是一种新的命令行工具,可以快速计算全基因组测序覆盖率。输入BAM或CRAM文件,计算在基因组中每个核苷酸位置或基因组区域的深度;也可将基因组区域指定为被BED文件覆盖的指定区域,或者指定为CNV calling所需的固定大小窗口。测试来看,Mosdepth比大多数现有工具更快。
下载安装
## github上有编译好的最新版的二进制文件,可以下载使用
wget https://github.com/brentp/mosdepth/releases/download/v0.3.1/mosdepth
chmod +x mosdepth
./mosdepth --help
mosdepth 0.3.1
Usage: mosdepth [options] <prefix> <BAM-or-CRAM>
Arguments:
<prefix> outputs: `{prefix}.mosdepth.dist.txt`
`{prefix}.mosdepth.summary.txt`
`{prefix}.per-base.bed.gz` (unless -n/--no-per-base is specified)
`{prefix}.regions.bed.gz` (if --by is specified)
`{prefix}.quantized.bed.gz` (if --quantize is specified)
`{prefix}.thresholds.bed.gz` (if --thresholds is specified)
<BAM-or-CRAM> the alignment file for which to calculate depth.
Common Options:
-t --threads <threads> number of BAM decompression threads [default: 0]
-c --chrom <chrom> chromosome to restrict depth calculation.
-b --by <bed|window> optional BED file or (integer) window-sizes.
-n --no-per-base dont output per-base depth. skipping this output will speed execution
substantially. prefer quantized or thresholded values if possible.
-f --fasta <fasta> fasta file for use with CRAM files [default: ].
--d4 output per-base depth in d4 format.
Other options:
-F --flag <FLAG> exclude reads with any of the bits in FLAG set [default: 1796]
-i --include-flag <FLAG> only include reads with any of the bits in FLAG set. default is unset. [default: 0]
-x --fast-mode dont look at internal cigar operations or correct mate overlaps (recommended for most use-cases).
-q --quantize <segments> write quantized output see docs for description.
-Q --mapq <mapq> mapping quality threshold. reads with a quality less than this value are ignored [default: 0]
-T --thresholds <thresholds> for each interval in --by, write number of bases covered by at
least threshold bases. Specify multiple integer values separated
by ','.
-m --use-median output median of each region (in --by) instead of mean.
-R --read-groups <string> only calculate depth for these comma-separated read groups IDs.
-h --help show help
参数:
-t 线程数,推荐4个,After about 4 threads, there is no benefit for additional threads
-c 指定染色体去计算深度
-b 指定基因组区域,比如基因的位置,就能计算基因的覆盖度
-F 排除含有该FLAG的reads。默认是1796,代表着:
read unmapped (0x4)
not primary alignment (0x100)
read fails platform/vendor quality checks (0x200)
read is PCR or optical duplicate (0x400)
-Q mapping quality,默认是0,一般可以按照自己的需要进行调整为5或者10
-x 快速模式
-T 按照阀值(比如1X,2X,10X)输出达到指定阀值的核苷酸数目
使用
# To calculate the coverage in each exome capture region
mosdepth --by capture.bed sample-output sample.exome.bam
# coverage thresholds
mosdepth --by capture.bed --thresholds 1,3,6,10 sample-output sample.exome.bam
# WGS example For 500-base windows
mosdepth -n --fast-mode --by 500 sample.wgs $sample.wgs.cram
indexcov
indexcov是goleft内的一个小工具,goleft is a collection of bioinformatics tools distributed under MIT license in a single static binary。
注意到这个软件,是因为mosdepth和indexcov的作图,可以上github上查看,https://github.com/brentp/goleft/tree/master/indexcov
Quickly estimate coverage from a whole-genome bam or cram index. A bam index has 16KB resolution so that's what this gives, but it provides what appears to be a high-quality coverage estimate in seconds per genome.
The output is scaled to around 1. So a long stretch with values of 1.5 would be a heterozygous duplication. This is useful as a quick QC to get coverage values across the genome.
In our tests, we can estimate depth across 60X genomes for 30 samples in 30 seconds.
测序深度和测序覆盖度,很多时候对于不同的研究有着不同的要求,下面给出网上的一个表格:
Coverage depth recommendations, for human ~3Gb
Category | Detection or Application | Recommended Depth (x) or Reads (millions) | References |
---|---|---|---|
Whole genome sequencing | Homozygous SNVs | 15x | Bentley et al., 2008 |
Heterozygous SNVs | 33x | Bentley et al., 2008 | |
INDELs | 60x | Feng et al., 2014 | |
Genotype calls | 35x | Ajay et al., 2011 | |
CNV | 1-8x | Xie et al., 2009; Medvedev at al., 2010 | |
Whole exome sequencing | Homozygous SNVs | 100x (3x local depth) | Clark et al., 2011; Meynert et al., 2013 |
Heterozygous SNVs | 100x (13x local depth) | Clark et al., 2011; Meynert et al., 2013 | |
INDELs | not recommended | Feng et al., 2014 | |
Transcriptome Sequencing | Differential expression profiling | 10-25M | Liu Y. et al., 2014; ENCODE 2011 RNA-Seq |
Alternative splicing | 50-100M | Liu Y. et al., 2013; ENCODE 2011 RNA-Seq | |
Allele specific expression | 50-100M | Liu Y. et al., 2013; ENCODE 2011 RNA-Seq | |
De novo assembly | >100M | Liu Y. et al., 2013; ENCODE 2011 RNA-Seq | |
DNA Target-Based Sequencing | ChIP-Seq | 10-14M (sharp peaks); 20-40M (broad marks) | Rozowsky et al., 2009; ENCODE 2011 Genome; Landt et al., 2012 |
Hi-C | 100M | Belton, J.M et al., 2012 | |
4C (Circularized Chromosome Confirmation Capture) | 1-5M | van de Weken, H.J.G. et al., 2012 | |
5C (Chromosome Carbon Capture Carbon Copy) | 15-25M | Sanyal A. et al., 2012 | |
ChIA-PET (Chromatin Interaction Analysis by Paired-End Tag Sequencing) | 15-20M | Zhang, J. et al., 2012 | |
FAIRE-Seq | 25-55M | ENCODE 2011 Genome; Landt et al., 2012 | |
DNAse 1-Seq | 25-55M | Landt et al., 2012 | |
DNA Methylation Sequencing | CAP-Seq | >20M | Long, H.K. et al., 2013 |
MeDIP-Seq | 60M | Taiwo, O. et al., 2012 | |
RRBS (Reduced Representation Bisulfite Sequencing) | 10X | ENCODE 2011 Genome | |
Bisulfite-Seq | 5-15X; 30X | Ziller, M.J et al., 2015; Epigenomics Road Map | |
RNA-Target-Based Sequencing | CLIP-Seq | 10-40M | Cho J. et al., 2012; Eom T. et al., 2013; Sugimoto Y. et al., 2012 |
iCLIP | 5-15M | Sugimoto Y. et al., 2012; Rogelj B. et al., 2012 | |
PAR-CLIP | 5-15M | Rogelj B. et al., 2012 | |
RIP-Seq | 5-20M | Lu Z. et al., 2014 | |
Small RNA (microRNA) Sequencing | Differential Expression | ~1-2M | Metpally RPR et al., 2013; Campbell et al., 2015 |
Discovery | ~5-8M | Metpally RPR et al., 2013; Campbell et al., 2015 |
References
Recommended Coverage and Read Depth
—— dulunar 后记于 2020.10