直接给我惊呆了,从点开开始到出结果一共32秒ಥ_ಥ,想当年光blastp就得跑大半天,更不要说编译MCScanX,各种奇葩问题...
先看下什么是共线性:
Syntenic* = a set of loci in two different species which is located on the same chromosome in each (not necessarily in the same order).
Collinear = a set of loci in two different species which are located on the same chromosome in each, and are conserved in the same order.
Synteny now has the same meaning as collinear, despite the different origins of the terms
Source: Genomics and comparative genomics
之前做共线性分析的时候用的是MCscanX,还有python版本的JCVI,第一步的blastp基本是要了老命了,当时为了加快blastp的速度,把query的pep.fa文件切成几十个批量跑....反正也是折腾了好久。
最近CJ发布了两个插件:OneStepMCScanX-Diamond和Quick Genome Dot Plot,从名字上就能看出来这个是怎么回事了,就是基于Diamond进行序列比对(Diamond比blast快很多倍!而且准确度也相差不是很多)其实相当于把序列比对和比对完的结果用MCScanX生成共线性文件,clt文件,包括gtf文件的合并等融合到一个步骤中了。当然现在的OneStep只是两个基因组间的比较。
具体方法
插件 | 超快!全基因组共线性分析
插件 | 超快!全基因组 Dot Plot
简洁 | 优雅地准备 比较基因组分析 文件
可能遇到的其他问题
去除scaffold
一般在共线性展示的时候,如果你的参考基因组组装到了染色体水平,scaffold或者contig等是要去掉的。如果有一点shell基础的话这个就很简单了
比如
- 首先看下你gff文件的头文件,看看染色体编号形式和scaffold或者contig等的编号形式,这里我们看下JGI的雷蒙德氏棉的gff3
head -30 gene.Grai.JGI.gff3
##gff-version 3
##sequence-region Chr01 6158 55678711
##sequence-region Chr02 75 62758601
##sequence-region Chr03 1 45758218
##sequence-region Chr04 2059 62173553
##sequence-region Chr05 6721 64137711
##sequence-region Chr06 50440 51073468
##sequence-region Chr07 1895 60974818
##sequence-region Chr08 36764 57128522
##sequence-region Chr09 8203 70705631
##sequence-region Chr10 1619 62168479
##sequence-region Chr11 3378 62659606
##sequence-region Chr12 22670 35427486
##sequence-region Chr13 518 58321163
##sequence-region scaffold_102 2100 3605
##sequence-region scaffold_107 2825 16443
##sequence-region scaffold_108 3 13412
##sequence-region scaffold_111 13319 15939
##sequence-region scaffold_112 4529 5790
##sequence-region scaffold_114 53 12661
##sequence-region scaffold_115 3197 6044
##sequence-region scaffold_116 268 1961
##sequence-region scaffold_119 6 12264
##sequence-region scaffold_122 5843 8827
##sequence-region scaffold_127 2607 11788
再看下中棉所亚洲棉的gff3
head -30 gene.Garb.CRI.gff3
##gff-version 3
##sequence-region Chr01 1290 113026203
##sequence-region Chr02 50819 99045064
##sequence-region Chr03 942 135702384
##sequence-region Chr04 55724 98593909
##sequence-region Chr05 32236 97875121
##sequence-region Chr06 5813 132229846
##sequence-region Chr07 21338 97506195
##sequence-region Chr08 8313 129422927
##sequence-region Chr09 10279 85030627
##sequence-region Chr10 64098 129481832
##sequence-region Chr11 5334 124529361
##sequence-region Chr12 6026 103049777
##sequence-region Chr13 15788 123934920
##sequence-region tig00000076 27806 38254
##sequence-region tig00000208 71464 82652
##sequence-region tig00000247 19374 20588
##sequence-region tig00000248 17267 18598
##sequence-region tig00000254 5842 784687
##sequence-region tig00000266 95796 99132
##sequence-region tig00000267 76876 81478
##sequence-region tig00000312 258908 329614
##sequence-region tig00000458 14417 14761
##sequence-region tig00000498 12571 1498618
##sequence-region tig00000545 29227 220599
##sequence-region tig00000736 198220 210945
##sequence-region tig00000744 182460 186937
##sequence-region tig00000762 9152 12184
##sequence-region tig00000807 27129 30536
##sequence-region tig00000820 21055 21872
现在我们发现了再雷棉的gff3中是scaffold,而再亚洲棉的gff3中是tig,那么我们用grep -v(取反)将带有contig
或者scaffold
的行去掉
grep -v "tig" gene.Garb.CRI.gff3 > notigs.Garb.CRI.gff3
grep -v "scaff" gene.Grai.JGI.gff3 > notigs.Grai.CRI.gff3
检查一下:
head -30 notigs.Garb.CRI.gff3
##gff-version 3
##sequence-region Chr01 1290 113026203
##sequence-region Chr02 50819 99045064
##sequence-region Chr03 942 135702384
##sequence-region Chr04 55724 98593909
##sequence-region Chr05 32236 97875121
##sequence-region Chr06 5813 132229846
##sequence-region Chr07 21338 97506195
##sequence-region Chr08 8313 129422927
##sequence-region Chr09 10279 85030627
##sequence-region Chr10 64098 129481832
##sequence-region Chr11 5334 124529361
##sequence-region Chr12 6026 103049777
##sequence-region Chr13 15788 123934920
Chr01 EVM gene 1290 21313 . - . ID=Ga01G0001
###
Chr01 EVM mRNA 1290 21313 . - . ID=Ga01G0001.1
Chr01 EVM CDS 1290 1426 . - 2 ID=cds.Ga01G0001.1;Parent=Ga01G0001.1
Chr01 EVM CDS 21271 21313 . - 0 ID=cds.Ga01G0001.1;Parent=Ga01G0001.1
###
Chr01 EVM gene 26354 29394 . + . ID=Ga01G0002;Name=LPLAT1;description=Lysophospholipid acyltransferase 1
###
Chr01 EVM mRNA 26354 29394 . + . ID=Ga01G0002.1;Name=LPLAT1;description=Lysophospholipid acyltransferase 1
Chr01 EVM CDS 26354 26396 . + 0 ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01 EVM CDS 26960 27184 . + 2 ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01 EVM CDS 27238 27699 . + 2 ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01 EVM CDS 28202 28287 . + 2 ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01 EVM CDS 28442 28505 . + 0 ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01 EVM CDS 28585 28703 . + 2 ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01 EVM CDS 28894 29027 . + 0 ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
###################################
head -30 notigs.Grai.CRI.gff3
##gff-version 3
##sequence-region Chr01 6158 55678711
##sequence-region Chr02 75 62758601
##sequence-region Chr03 1 45758218
##sequence-region Chr04 2059 62173553
##sequence-region Chr05 6721 64137711
##sequence-region Chr06 50440 51073468
##sequence-region Chr07 1895 60974818
##sequence-region Chr08 36764 57128522
##sequence-region Chr09 8203 70705631
##sequence-region Chr10 1619 62168479
##sequence-region Chr11 3378 62659606
##sequence-region Chr12 22670 35427486
##sequence-region Chr13 518 58321163
Chr01 phytozomev10 gene 6158 6456 . - . ID=Gorai.001G000100;Name=Trmt61a;description=tRNA (adenine(58)-N(1))-methyltransferase catalytic subunit TRMT61A
Chr01 phytozomev10 mRNA 6158 6456 . - . ID=Gorai.001G000100.1;Parent=Gorai.001G000100;Name=Gorai.001G000100.1;pacid=26824196;longest=1
Chr01 phytozomev10 CDS 6158 6415 . - 0 ID=Gorai.001G000100.1.CDS.1;Parent=Gorai.001G000100.1;pacid=26824196
Chr01 phytozomev10 five_prime_UTR 6416 6456 . - . ID=Gorai.001G000100.1.five_prime_UTR.1;Parent=Gorai.001G000100.1;pacid=26824196
###
Chr01 phytozomev10 gene 6987 7491 . - . ID=Gorai.001G000200
Chr01 phytozomev10 mRNA 6987 7491 . - . ID=Gorai.001G000200.1;Parent=Gorai.001G000200;Name=Gorai.001G000200.1;pacid=26820978;longest=1
Chr01 phytozomev10 three_prime_UTR 6987 7146 . - . ID=Gorai.001G000200.1.three_prime_UTR.1;Parent=Gorai.001G000200.1;pacid=26820978
Chr01 phytozomev10 CDS 7147 7479 . - 0 ID=Gorai.001G000200.1.CDS.1;Parent=Gorai.001G000200.1;pacid=26820978
现在就干净了。
如果不会shell这些操作怎么办呢,别着急,我们有另外一个工具Text Block Extract and filter(再TBtools 搜索栏直接搜索)
按照下图的操作即可
发现染色体排序不符合预期
如果用过MCScanX 你一定会了解到
ctl
文件的作用。我们看下这个文件:
2500
600
Ga-Chr11,Ga-Chr10,Ga-Chr02,Ga-Chr13,Ga-Chr01,Ga-Chr12,Ga-Chr04,Ga-Chr03,Ga-Chr06,Ga-Chr05,Ga-Chr08,Ga-Chr07,Ga-Chr09
Gr-Chr11,Gr-Chr10,Gr-Chr02,Gr-Chr13,Gr-Chr01,Gr-Chr12,Gr-Chr04,Gr-Chr03,Gr-Chr06,Gr-Chr05,Gr-Chr08,Gr-Chr07,Gr-Chr09
看到了吧,罪魁祸首在这里,我们只需要手动的把这个配置文件的染色体顺序改正确就可以了
2500
600
Ga-Chr01,Ga-Chr02,Ga-Chr03,Ga-Chr04,Ga-Chr05,Ga-Chr06,Ga-Chr07,Ga-Chr08,Ga-Chr09,Ga-Chr10,Ga-Chr11,Ga-Chr12,Ga-Chr13
Gr-Chr01,Gr-Chr02,Gr-Chr03,Gr-Chr04,Gr-Chr05,Gr-Chr06,Gr-Chr07,Gr-Chr08,Gr-Chr09,Gr-Chr10,Gr-Chr11,Gr-Chr12,Gr-Chr13
发散下思维 前面也可以不去掉scaffold,到时候直接这里面把没有到染色体水平的那些玩意给删了是不是也可以,我这里没有测试。