写在前面
基因组组装完成后,我们需要对我们组装好的基因组进行一个质量评估,质量评估主要从连续性(continuity),完整性(completeness),准确性( accuracy)3个方面来进行。连续性可以通过一些基本统计(contig/scaffold N50,total length,gap number等)和LAI指数来验证;完整性可以通过BUSCO和Merqury评估来验证;准确性可以通过Merqury的QV值来验证。
1.基本统计
1.计算contig N50和scaffold N50
1.将组装的基因组从N处打断成contig
#方法一:用一个perl脚本拆开
ln -s ../../HiC/3d-dna/3d-dna1-r0/genome.FINAL.fasta genome.fa #要评估基因组的链接过来
chmod a+x breakGCscaf #为脚本给所有人(a)加一个可执行(x)的权限
breakGCscaf -m 1 ./genome.fa contig.fasta #用breakGCscaf脚本打断
#方法二:用前面purge_dups软件里一个程序(split_fa)拆开
/newlustre/home/jfgui/Wangtao/software/purge_dups/bin/split_fa ./genome.fa > contig.fasta
2.用assembly-stats统计contig/scaffold N50
$ assembly-stats genome.fa contig.fasta
stats for genome.fa
sum = 783116805, n = 110, ave = 7119243.68, largest = 42815409
N50 = 29543814, n = 12
N60 = 25721736, n = 15
N70 = 23824324, n = 18
N80 = 20458000, n = 22
N90 = 18004348, n = 26
N100 = 1000, n = 110
N_count = 283500
Gaps = 567
-------------------------------------------------------------------------------
stats for contig.fasta
sum = 782833305, n = 677, ave = 1156326.89, largest = 18197332
N50 = 6715579, n = 37
N60 = 5193401, n = 50
N70 = 3739176, n = 68
N80 = 2012057, n = 96
N90 = 697200, n = 163
N100 = 184, n = 677
N_count = 0
Gaps = 0
2.计算GC含量及scaffold长度
用seqkit软件的 fx2tab命令来统计
seqkit fx2tab -g -H -l -n -i genome.fa
# -g, --gc print GC content 打印GC含量
# -H, --header-line print header line 打印表头
# -l, --length print sequence length 打印序列长度
# -n, --name only print names (no sequences and qualities) 只打印序列ID那一行而不打印后面基因序列
# -i, --only-id print ID instead of full head 只打印ID
结果:
#id length GC
Chr_1 42815409 38.82
Chr_2 37673591 38.79
Chr_3 36963109 38.57
Chr_4 36096355 38.71
Chr_5 35589000 39.04
Chr_6 30904655 38.80
Chr_7 30903512 38.51
Chr_8 30771000 38.91
Chr_9 29993899 38.95
Chr_10 29948573 38.49
Chr_11 29567000 38.87
Chr_12 29543814 39.14
Chr_13 28239655 38.83
Chr_14 27962040 38.72
Chr_15 25721736 38.40
Chr_16 25470539 39.03
Chr_17 24675873 38.60
Chr_18 23824324 38.79
Chr_19 22225640 39.14
Chr_20 21381000 38.85
Chr_21 20958966 39.14
Chr_22 20458000 39.18
Chr_23 20400676 38.68
Chr_24 20147020 38.91
Chr_25 19768941 38.61
Chr_26 18004348 39.04
Chr_27 17662559 38.97
Chr_28 17226321 39.57
Chr_29 13820029 39.66
Chr_30 12460476 39.87
HiC_scaffold_31 320000 37.39
HiC_scaffold_32 153392 32.55
HiC_scaffold_33 125279 30.75
........
3.计算长短reads的mapping率以及coverage
二代和三代测序reads对比到基因组上的比对率以及覆盖度,比对软件分别用bwa和minimap2,统计软件用samtools的flagstats命令和coverage命令。
1 二代短reads
ln -s ../../HiC/3d-dna/3d-dna1-r0/genome.FINAL.fasta genome.fa #把基因组链接过来
/newlustre/home/jfgui/Wangtao/software/bwa/bwa index -p genome genome.fa #建立基因组比对index,-p为index的前缀
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 24 genome /newlustre/home/jfgui/Wangtao/XX_survey/fastp.out/liver.clean.CLEAN.R1.fastq.gz /newlustre/home/jfgui/Wangtao/XX_survey/fastp.out/liver.clean.CLEAN.R2.fastq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@ 24 - -o ShortReadsMap.bam
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools flagstats ShortReadsMap.bam > genome.short.flagstats
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools coverage ShortReadsMap.bam > genome.short.coverage #计算基因组内每条序列覆盖度
awk 'NR > 1 {sum+=$3; rd +=$4; cov += $5; dp+=$7*$5 }END{print "Total: "sum"\nCovbase: "cov"\ncoverage: "cov/sum"\nmeandepth: "dp/sum}' genome.short.coverage > genome.short.coverage_all #汇总获得全基因组覆盖度
2 三代长reads,同短reads一样
ln -s ../../HiC/3d-dna/3d-dna1-r0/genome.FINAL.fasta genome.fa
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/minimap2 -ax map-pb --secondary=no -t 24 ./genome.fa /newlustre/home/jfgui/Wangtao/XX_assembly/PacBio/P01TYD20308306-1_r64030_20201110_065049_1_A02.subreads.fasta | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@ 24 - -o LongReadsMap.bam
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools flagstats LongReadsMap.bam > genome.long.flagstats
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools coverage LongReadsMap.bam > genome.long.coverage
awk 'NR > 1 {sum+=$3; rd +=$4; cov += $5; dp+=$7*$5 }END{print "Total: "sum"\nCovbase: "cov"\ncoverage: "cov/sum"\nmeandepth: "dp/sum}' genome.long.coverage > genome.long.coverage_all
结果:
$cat genome.short.flagstats
316916360 + 0 in total (QC-passed reads + QC-failed reads)
315055594 + 0 primary
0 + 0 secondary
1860766 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
315264296 + 0 mapped (99.48% : N/A)
313403530 + 0 primary mapped (99.48% : N/A)
315055594 + 0 paired in sequencing
157527797 + 0 read1
157527797 + 0 read2
307109606 + 0 properly paired (97.48% : N/A)
313246052 + 0 with itself and mate mapped
157478 + 0 singletons (0.05% : N/A)
4683530 + 0 with mate mapped to a different chr
3413989 + 0 with mate mapped to a different chr (mapQ>=5)
$cat genome.short.coverage
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
HiC_scaffold_1 1 42815409 18439742 42618605 99.5403 63.6268 35.8 47.7
HiC_scaffold_2 1 37673591 16436610 37520042 99.5924 64.5272 35.8 48.9
HiC_scaffold_3 1 36963109 14658156 36826774 99.6312 58.5624 35.8 55.5
HiC_scaffold_4 1 36096355 14071479 35990500 99.7067 57.6329 35.8 56.4
HiC_scaffold_5 1 35589000 14190480 35416293 99.5147 58.8685 35.8 51.7
HiC_scaffold_6 1 30904655 11973922 30802327 99.6689 57.1446 35.7 55.6
HiC_scaffold_7 1 30903512 11997247 30795968 99.652 57.2322 35.7 56
HiC_scaffold_8 1 30771000 12112140 30670550 99.6736 58.0751 35.7 55.1
HiC_scaffold_9 1 29993899 12265563 29913772 99.7329 60.2536 35.7 53
HiC_scaffold_10 1 29948573 11678423 29833228 99.6149 57.5587 35.8 56.2
HiC_scaffold_11 1 29567000 11740016 29462523 99.6466 58.6067 35.7 54.1
HiC_scaffold_12 1 29543814 14119670 29266134 99.0601 70.4734 35.7 39.2
....
$cat genome.short.coverage_all
Total: 783116805nCovbase: 779981519ncoverage: 0.995996nmeandepth: 59.1533
2.Merqury评估
merqury 软件基于测序数据 kmer 和组装结果 kmer 的一致性进行基因组完整性和consensus quality value(QV) 计算。
安装merquery:
## 下载即可,依赖meryl
git clone https://github.com/marbl/merqury.git
- 根据基因组大小获取合适的K值
$/newlustre/home/jfgui//Wangtao/software/merqury/best_k.sh 783116805 #用merqury软件里的best_k.sh脚本
genome: 783116805
tolerable collision rate: 0.001
19.7545
- 用meryl软件对测序数据的kmer统计
#安装meryl软件
conda install -c bioconda meryl
#kmer统计
meryl k=19 threads=24 count /newlustre/home/jfgui//Wangtao/XX_survey/fastp.out/liver.clean.CLEAN.R1.fastq.gz output XXR1.out.meryl #R1统计
meryl k=19 threads=24 count /newlustre/home/jfgui//Wangtao/XX_survey/fastp.out/liver.clean.CLEAN.R2.fastq.gz output XXR2.out.meryl #R2统计
- 对双末端的kmer合并
meryl k=19 union-sum output XXall.meryl XXR1.out.meryl XXR2.out.meryl
- 运行merquery程序进行评估
ln -s ../../HiC/3d-dna/3d-dna1-r0/genome.FINAL.fasta genome.fa
/newlustre/home/jfgui/Wangtao/software/merqury/merqury.sh XXall.meryl ./genome.fa XXmerqury.out
评估结果:
主要看完整度,qv值
#查看完整度 结果为98.6039%
$cat XXmerqury.out.completeness.stats
genome all 554916632 562773356 98.6039
#查看QV值 结果为42.2611(每一万个碱基里有0.59149个碱基出现错误),Q40为99.99%正确率
$cat XXmerqury.out.qv
genome 883239 782821119 42.2611 5.94148e-05
kmer频数直方图:即将测序reads所有kmer进行分类,特有的kmer,在基因组出现1,2,3.....次的kmer
3.BUSCO评估
从单拷贝直系同源基因角度评估基因组完整性。大概原理是先下载一个研究物种所属大类的单拷贝直系同源基因的数据库,然后与组装好的基因组去对比,看看能找到多少以评估组装质量。
BUSCO软件安装
#建议自动安装,依赖的东西比较多
conda install -c bioconda busco
1.下载busco数据库
因为我研究的物种是鱼,因此下载一个辐鳍鱼纲数据库actinopterygii_odb10
用于比较,可看文献参考用那些数据库。
busco数据库的网址为: https://busco-data.ezlab.org/v5/data/lineages/
mkdir buscoDB
cd buscoDB
wget https://busco-data.ezlab.org/v5/data/lineages/actinopterygii_odb10.2021-02-19.tar.gz #下载
tar -zxvf actinopterygii_odb10.2021-02-19.tar.gz #解压
cd ..
2.运行busco进行评估
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/busco -m geno -i YYgenome.fa -o YY_offline -l ./buscoDB/actinopterygii_odb10 -c 24 --offline
# -m 分析模式,有- geno(基因组组装) - tran(转录组组装) - prot(蛋白序列,注释的基因集) 三种模式
# -i 输入文件
# -o 输出路径
# -l 数据库
# --offline 离线模式,即我们已经下载好了数据库
3.对结果画图
如果有多个评估结果,可以一起整理成图,放在一张图里来看,操作如下
mkdir BUSCO_summaries #建一个目录存放在所有的busco评估结果
cp ./*/short_summary.*.txt BUSCO_summaries/ #将所有评估结果都拷到这个目录下
generate_plot.py -wd BUSCO_summaries/ #最后运行generate_plot.py脚本即可, -wd 即工作路径
查看结果
96.7%的基因完整度,包括95.7%单拷贝基因和1.0%的多拷贝基因。
$cat XX_offline/short_summary.specific.actinopterygii_odb10.XX_offline.txt
# BUSCO version is: 5.4.7
# The lineage dataset is: actinopterygii_odb10 (Creation date: 2021-02-19, number of genomes: 26, number of BUSCOs: 3640)
# Summarized benchmarking in BUSCO notation for file /newlustre/home/jfgui/Wangtao/XX_assembly/QualAssess1/5.BUSCO/XXgenome.fa
# BUSCO was run in mode: euk_genome_met
# Gene predictor used: metaeuk
***** Results: *****
C:96.7%[S:95.7%,D:1.0%],F:0.7%,M:2.6%,n:3640
3517 Complete BUSCOs (C)
3482 Complete and single-copy BUSCOs (S)
35 Complete and duplicated BUSCOs (D)
27 Fragmented BUSCOs (F)
96 Missing BUSCOs (M)
3640 Total BUSCO groups searched
Assembly Statistics:
110 Number of scaffolds
677 Number of contigs
783116805 Total length
0.036% Percent gaps
29 MB Scaffold N50
6 MB Contigs N50
Dependencies and versions:
hmmsearch: 3.1
bbtools: 39.01
metaeuk: 6.a5d39d9
busco: 5.4.7
图片形式展示:
3.LAI评估
LTR组装指数(LTR Assembly Index, LAI)是用完整LTR-RTs转座子在所有LTR-RTs的占比来评估基因组组装连贯性的一个指数,由于LTR在基因组中的比例很高,且由于其重复性组装难度高,所以它的组装质量与基因组组装有密切关系。
软件安装
# gt工具箱(genometools),下载源代码进行编译
wget https://github.com/genometools/genometools/archive/refs/tags/v1.6.5.zip
unzip genometools-1.6.5.zip
cd genometools-1.6.5
mke -j 4
# LTR_FINDER_parallel,同样的方式安装
wget https://github.com/oushujun/LTR_FINDER_parallel/archive/refs/tags/v1.1.gz
unzip LTR_FINDER_parallel-1.1.zip
cd LTR_FINDER_parallel-1.1
make
# LTR_retriever,自动安装LTR_retriever以及依赖
conda install -c bioconda ltr_retriever
- 用gt软件包中的 ltrharvest预测基因组中的LTR
#先用suffixerator建立index
/newlustre/home/jfgui/Wangtao/software/genometools-1.6.5/bin/gt suffixerator -db genome.fa -indexname genome-fa -tis -suf -lcp -des -ssp -sds -dna
#再运行 ltrharvest预测LTR
/newlustre/home/jfgui/Wangtao/software/genometools-1.6.5/bin/gt ltrharvest -index genome-fa -minlenltr 100 -maxlenltr 7000 -mindistltr 1000 -maxdistltr 15000 -mintsd 4 -maxtsd 20 -motif TGCA -motifmis 1 -similar 85 -vic 60 -seed 20 -seqids yes > genome.fa.harvest.scn &>ltrharvest.log
- 用另一个软件LTR_FINDER_parallel同时预测基因组中的LTR
/newlustre/home/jfgui/Wangtao/software/LTR_FINDER_parallel-1.1/LTR_FINDER_parallel -seq genome.fa -threads 24 -harvest_out -size 5000000 -time 1500 -w 2 -C -D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.85
3.将两个软件预测的结果合并
cat genome.fa.harvest.scn genome.fa.finder.combine.scn >genome.fa.rawLTR.scn
4.用LTR_retriever进行整合,并计算LAI
/newlustre/home/jfgui/Wangtao/software/LTR_retriever-2.9.5/LTR_retriever -genome genome.fa -inharvest genome.fa.rawLTR.scn -threads 24 &>ltr-retriever.log
查看结果
结果不咋地,LAI的值为5.59。根据LAI开发者的文章,把LAI值分成三个类别,Draft级别(0≤LAI<10),Reference级别(10≤LAI<20),Gold级别(20≤LAI),10以上才算好些。
$ head genome.fa.mod.out.LAI
Chr From To Intact Total raw_LAI LAI
whole_genome 1 783116805 0.0050 0.2060 2.41 5.59
1 1 3000000 0.0145 0.4138 3.51 6.69
1 300001 3300000 0.0129 0.3909 3.30 6.48
1 600001 3600000 0.0129 0.3656 3.52 6.70
1 900001 3900000 0.0129 0.3461 3.72 6.90
1 1200001 4200000 0.0126 0.3244 3.88 7.06
1 1500001 4500000 0.0103 0.3119 3.31 6.49
1 1800001 4800000 0.0115 0.3055 3.76 6.94
1 2100001 5100000 0.0089 0.2863 3.10 6.28
其他相关推荐
Merqury评估基因组质量 - 简书 (jianshu.com)
基因组质量评估:(四)用LAI评估基因组组装的连贯性 - 知乎 (zhihu.com)