LTR_retriever | LTR整合分析工具

Transposable elements (TEs) are ubiquitous interspersed repeats in most sequenced eukaryote genomes (Wessler, 2006).

According to their transposition schemes, TEs are categorized into two classes.

Class I TEs (retrotransposons) use RNA intermediates with a copy-and-paste transposition mechanism (Kumar and Bennetzen, 1999; Wicker et al., 2007).

Class II TEs (DNA transposons) use DNA intermediates with a cut and-paste mechanism (Feschotte and Pritham, 2007; Wicker et al., 2007).

Depending on the presence of long terminal repeats (LTRs), class I TEs are further classified as LTR retrotransposons (LTR-RTs) and non-LTR-RTs, including short interspersed nuclear elements (SINEs)and long interspersed nuclear elements (LINEs) (Han,2010).

Although the structure of the LTR-RT is conserved among species, their nucleotide sequences are not conserved except among closely related species. Particularly, substantial sequence diversity is observed within the LTR region. Therefore, LTR-RTs are usually not adequately identified based on sequence homology.

一、背景篇

1. LTR-RT的结构

在植物基因组中，I类转座因子，LTR-RT(LTR retrotransposons)是基因组扩张的主要原因。完整的LTR长度在85~5000 bp之间，下图图A表示的是一个完整的LTR-RT，灰色框表示TSD(target site duplications), 红色三角形表示LTR motif(长度在2bp左右), 蓝色框表示LTR。LTR中间序列长度在1,000~15,000之间波动。

image.png

The structure of LTR-RTs, their derivatives, and false positives.

A, The structure of an intact LTR-RT with
(1) LTR (navy pentagons 海军五边形),
(2) a pair of dinucleotide palindromic motifs(二核苷酸回文基序) flanking each LTR (magenta triangles洋红色三角形), 【In plants, LTRs are typically flanked by 2-bp palindromic motifs, commonly 5'-TG...CA-3', with some rare exceptions.】
(3) the internal region including protein-coding sequences for gag, pol, and env (green boxes), and
(4) a 5-bp target site duplication (TSD) flanking the element (gray boxes).

B, A truncated LTR-RT with missing structural components.

C, A solo LTR.

D, A nested LTR-RT with another LTR-RT inserted into its coding region.

E, A false LTR-RT detected due to two adjacent non-LTRs (gray boxes). The counterfeit also features a direct repeat (blue pentagons) but usually has extended sequence similarity on one or both sides of the LTR (orange and brown boxes). Regions a to dare extracted and analyzed by LTR_retriever.

完整的LTR-RT主要归为两大类: Gypsy和Copia。如果LTR中间的序列不包含开放阅读框(ORF), 那么所属的LTR-RT就无法独立的转座。

Given these mutation mechanisms, intact elements only contribute a small fraction of all LTR-RT-related sequences in a genome. If the required structural components are altered (i.e. mutated, truncated, and nest inserted by other TEs; Fig. 1), the LTR element becomes nonautonomous and is difficult to identify using structural information.

2. LTR_retriever

LTR_retriever 是一个命令行程序（在 Perl 中），用于从 LTRharvest、LTR_FINDER、MGEScan 3.0.0、LTR_STRUC 和 LtrDetector 的输出中准确识别 LTR 逆转录转座子（LTR-RT），并生成用于基因组注释的非冗余 LTR-RT 库.

默认情况下，程序将生成全基因组 LTR-RT 注释和 LTR 组装索引 (LAI)，用于评估输入基因组的组装连续性。用户还可以单独运行 LAI（请参阅参考资料Usage）。

we introduce LTR_retriever, a novel tool for the identification of LTR-RTs. This package efficiently removes false positives from initial software predictions. It is possible to reduce false positives by defining more stringent parameters, such as high LTR similarity, intermediate LTR length, and TGCA motif.

Identification of LTR-RTs with Noncanonical Motifs

LTR-RT features dinucleotide motifs flanking the direct repeat regions (Fig. 1). The most common motif is the palindromic 59-TG.CA-39 motif. However, during manual curation of LTR-RTs, we discovered many LTRs with non-TGCA motifs (A.A. Ferguson and N.Jiang, unpublished data). These noncanonical motifs can be nonpalindromic: for example, Tos17, a rice LTR-RT that can be activated by tissue culture, has noncanonical motifs of 59-TG.GA-39 (Hirochika et al.,1996); AtRE1 in Arabidopsis has 59-TA.TA-39 motifs (Kuwahara et al., 2000); and TARE1, intensively amplified in the tomato (Solanum lycopersicum) genome, has 59-TA.CA-39 motifs (Yin et al., 2013). In addition, three copies of Gypsy-like elements with 59-TG.CT-39 motifs were annotated in

noncanonical Copia elements prefer nonrepetitive genomic
regions and are often inserted within or close to genes.

Previous studies indicate that Gypsy and Copia elements are differentially located in plant genomes. The distribution of $\color{red}{Copia}$ elements is biased toward euchromatic chromosomal arms that are relatively close to genes, whereas $\color{red}{Gypsy}$ elements are more likely located in the gene-poor, heterochromatic or pericentromeric regions (Baucom et al., 2009; Bousios et al., 2012). Here, we demonstrate that the noncanonical Copia elements are even closer to genes than canonical Copia elements and insert preferentially into nonrepetitive sequences (Fig. 5). Apparently, there is a negative correlation between the distance to genes and element size, particularly the size of LTRs. As a result, the limited amplification and smaller size are likely the consequences of the target specificity of noncanonical LTR elements.

In summary, we developed a package that takes genome sequences or corrected PacBio reads as input and generates high-quality, nonredundant libraries for LTR elements. It also provides information about the insertion time and location of intact LTR elements in the genome. This tool demonstrates significant improvements in specificity, accuracy, and precision while maintaining high sensitivity compared with existing methods. As a result, it will facilitate future genome assembly and annotation as well as enable rapid comparative studies of LTR-RT dynamics in multiple genomes.

二、安装篇

LTR_retriever不是一个独立的工具，他的主要作用就是整合 LTRharvest, LTR_FINDER, MGEScan 3.0.0, LTR_STRUC, 和 LtrDetector的结果，过滤其中的假阳性LTR-RT，得到高质量的LTR-RT库。

先下载LTR_retriever本体

git clone https:///github.com/oushujun/LTR_retriever.git

之后修改LTR_retriever下的paths, 提供BLAST+, RepeatMasker， HMMER， CDHIT这些工具的路径。

BLAST+=/your_path_to/BLAST+2.2.30/bin/
RepeatMasker=/your_path_to/RepeatMasker4.0.0/
HMMER=/your_path_to/HMMER3.1b2/bin/
CDHIT=/your_path_to/CDHIT4.6.1/
BLAST=/your_path_to/BLAST2.2.26/bin/ #not required if CDHIT provided

更加方便的安装方法用Bioconda安装好cd-hit repeatmasker，然后下载LTR_retriever:

conda create -n LTR_retriever
source activate LTR_retriever
conda install -c conda-forge perl perl-text-soundex
conda install -c bioconda cd-hit
conda install -c bioconda/label/cf201901 repeatmasker
git clone https://github.com/oushujun/LTR_retriever.git
./LTR_retriever/LTR_retriever -h

此外你还需要额外安装LTRharvest, LTR_FINDER 和MGEScan_LTR。

LTRharverst: http://genometools.org/
LTR_FINDER: https://github.com/xzhub/LTR_Finder
修改版MGEScan_LTR: http://dawgpaws.sourceforge.net/

由于MGEScan_LTR装起来比我想象中麻烦，所以本文就仅使用LTRharverst和LTR_FINDER

三、使用篇

尽管LTR_retriever支持多个LTR工具的输入，但其实上LTRharverst和LTR_FINDER的结果就已经很不错了。

以拟南芥的基因组序列为例，分别使用LTRharverst和LTR_FINDER来寻找拟南芥中潜在LTR序列，之后用LTR_retreiver来合并结果。

#LTRharvest
gt suffixerator \
  -db TAIR10.fa \
  -indexname TAIR10 \
  -tis -suf -lcp -des -ssp -sds -dna

gt ltrharvest \
  -index TAIR10 \
  -similar 90 -vic 10 -seed 20 -seqids yes \
  -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 \
  -motif TGCA -motifmis 1  > TAIR10.harvest.scn &


#LTR_FINDER
LTR_FINDER_parallel \
  -seq TAIR10.fa \
  -threads 10 -harvest_out \
  -size 1000000 -time 300

LTR_retriever支持单个候选的LTR，

LTR_retriever -genome TAIR10.fa -inharvest TAIR10.harvest.scn

也支持多个候选LTR输入

LTR_retriever \
  -genome TAIR10.fa \
  -inharvest TAIR10.harvest.scn \         # LTRharvest 的输出结果
  -infinder TAIR10.finder.scn \           # LTR_FINDER 的输出结果
  -threads 20

输出文件如下：输出为$REFERENCE.LTRlib.fa，重命名为LTR.lib，作为后续repeatmasker屏蔽重复时的输入文件。

image

LTR_retriever 的输出包括

1.具有坐标和结构信息的完整 LTR-RT

汇总表 (.pass.list)
GFF3 格式输出 (.pass.list.gff3)

2.LTR-RT 库

所有非冗余 LTR-RT (.LTRlib.fa)
所有非 TGCA LTR-RT (.nmtf.LTRlib.fa)
所有具有冗余的 LTR-RT (.LTRlib.redundant.fa)

3. 非冗余库的全基因组 LTR-RT 注释

GFF 格式输出 (.out.gff)
LTR系列总结（.out.fam.size.list）
LTR 超家族总结（.out.superfam.size.list）
每个染色体上的 LTR 分布 (.out.LTR.distribution.txt)

4.LTR 汇编索引 (.out.LAI)

其他测试

LAI值是作者提出用于衡量基因组完整度参数。比较2个LTR输入和1个LTR输入的LAI值，后者是15.62，前者是14.47，这也意味这个值其实是受到输入的候选LTR数目影响，但最终结果应该稳定在一个阈值内。

我测试了多个物种在两种软件下找到的LTR，以及最终pass留下的LTR, 发现最终能够pass，数量都相对较少。同时限速步骤就是LTR_finder 和 LTRharvest。

物种	基因组大小	LTR_finder	LTRharvest	Pass	LAI	测序技术
A. lyrata	206M	1456	1017	1044	20.39	Sanger
A. thaliana (TAIR10)	120 M	207	550	184	15.62	Sanger
B. rapa (2.5)	391M	1251	3182	520	0	PacBio + 二代20Kb 40Kb文库
B. rapa (3.0)	353 M	3515	3635	1968	7.16	PacBio + BioNano + Hi-C
C.rubella	135 M	643	600	144	10.96	454 + Sanger
A. alpina	336 M	3840	3107	2556	11.01	PacBio + BioNano + Hi-C
某物种A	454 M	5384	2789	4294	17.89	PacBio

还有一个有趣的现象，B. rapa 3.0版本尽管是最近用三代加Hi-C组装的基因，但是以LAI的标准，只能算是Draft级别, 当然也比2.5版本好出不少。

当然作者也对很多物种的多个版本组装进行了比较，下图来自于 Assessing genome assembly quality using the LTR Assembly Index (LAI)