基因组组装完成后,可通过N50或者BUSCO,以及LAI评估组装质量。本文就LAI方法做一简单介绍。
基因组中的重复序列大体分为两类:
串联重复(Tandem repeats)
散在重复(Dispersed repeats)
其中串联重复含有:简单重复序列,卫星序列等;
散在重复包括:转座子(TE,transposons,elements)
TE又可细分为两类:
DNA transposons: 由DNA介导
RNA transposons: 由RNA介导,通过RNA的反转录获得DNA,从而转移到其他基因组位置。
目前主要存在两种类型的RNA转座子:
1 LTR (long terminal repeats)双末端都是长的重复序列
2 non-LTR TEs,双末端缺乏重复序列。 LINE1和SINE(short interspersed transposable element)长/短穿插转座元件
LTR结构
上图中TSD表示target site duplications,红色三角表示LTR motif。A图是一个完整的LTR结构,其中a,b,c是LTR_retriever的分析目标。
LAI指数就是完整LTR反转座子序列占总LTR序列长度的比值。
安装LTR_retriever
git clone https://github.com/oushujun/LTR_retriever.git
进入paths文件,修改各个软件所在路径:
##This file will provide LTR_retriever paths to dependent programs.
##You can leave the respective paths empty if programs are accessible through ENV (i.e. exported to .bashrc)
##If you specify a path, please make sure that the required program(s) is directly contained in that path but not in any child directories.
##e.g. BLAST+=/opt/software/BLAST+/2.2.30--GCC-4.4.5/bin/
##LTR_retriever is build based on GenomeTools/1.5.4, BLAST+/2.2.28, BLAST/2.2.26, CDHIT/4.6.1c, HMMER/3.1b2, RepeatMasker/4.0.0 and Tandem Repeats Finder 4.07b
BLAST+= /public/home/fengting/miniconda3/bin/ #a path that contains makeblastdb, blastn, blastx
RepeatMasker=/public/home/fengting/miniconda3/envs/annotation/bin/ #a path that contains RepeatMasker
HMMER=/public/home/fengting/miniconda3/bin/ #a path that contains hmmsearch
CDHIT=/public/home/fengting/demo/cd-hit-v4.8.1-2019-0228/ #a path that contains cd-hit-est (preferred). CDHIT and BLAST are replaceable
BLAST=~/miniconda3/bin/ #a path that contains blastclust (optional)
安装LRT_finder:
git clone https://github.com/xzhub/LTR_Finder.git
cd LTR_Finder/source/
make
使用:
###LTR_finder 鉴定LTR序列
/public/home/fengting/demo/lai/LTR_Finder/source/ltr_finder /public/home/fengting/demo/lai/LTR_Finder/source/test/3ds_72.fa >g.scn
###LTR_retriever根据LTR_FINDER的输出识别LTR-RT,生成非冗余LTR-RT文库,可用于基因组注释
/public/home/fengting/demo/lai/LTR_retriever/LTR_retriever -threads 4 -genome test/3ds_72.fa -infinder g.scn
所有依赖完成后
结果完成图
结果文件.out.LAI,第二行最后一个值就是LAI值
LAI评估指标
可视化展示:
library(ggplot2)
mydata1<-read.table("H7L1.fa.out.LAI",header = T)
pdf("1.pdf")
ggplot(mydata1,aes(x=From/1000000,y=LAI,group=factor(Chr),colour=Chr))+geom_line()+facet_wrap(mydata1$Chr)+xlab("Physical distance(Mb)")+ylab("LAI")+theme(legend.position = "none")+ theme_bw() +theme(panel.grid.major=element_line(colour=NA), panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
panel.grid.minor = element_blank())
dev.off()