LTR插入与基因的关系

参考这篇文章中TE与基因的分析流程：Deciphering recent transposition patterns in plants through comparison of 811 genome assemblies

'''Subsequently, the window tool within the BEDtools suite was utilized to determine the gene features in which transposon insertions were located (Quinlan and Hall, 2010). BEDtools intersect was used to find overlaps between SVs and TE annotations with the parameter ‘-wo’ to report their corresponding lengths. TE annotations were matched to structural variants if they intersected by at least 50%; we chose to do so because structural variant callers and transposon annotators have uncertainty regarding exact locations (Riehl et al., 2022). A transposon can be located in multiple genomic regions if the base pair position of the insertion constitutes multiple isoforms (e.g., the promoter of one gene or the intergenic region of another gene). Therefore, only one splice-form annotation was recorded according to the following priority order: whole gene (intersection >90%), exon, intron and regulatory region (2 kb upstream and downstream of genes).'''

分析思路

1 对文件夹下的序列运行HiTE -ltr，计算ltr位置和插入时间
2 注释结果里面，intact_LTR.list有identity和插入时间结果，HiTE.full_length.gff 里面是完整的LTR注释结果。但是两个结果不等同。HITE基于动态调整的结果导致其中有多个bp的位置差异，需要用脚本基于bedtool修改注释，把identity和插入时间附加到HiTE.full_length.gff 后
3 基于修改好的HiTE.full_length.gff 文件，基因组注释文件，可以将HiTE.full_length.gff 和基因组注释文件映射到一起，从而得知不同基因被LTR干扰情况。

分析流程

一文件准备

1 物种的fasta文件，用于HITE提取ltr，例如fasta.fasta

物种的基因组注释文件，比如fasta.pseudo_label.gff

#
python main.py \
  --genome fasta.fasta \
  --out_dir fasta1 \
  --curated_lib TElib.fasta \
  --te_type ltr \
  --thread 40 \
  --plant 1 \
  --annotate 1 \
  --recover 0 \
  --miu 9.4e-09 \
 --annotate 1

2 HiTE运行结果如下，解释一下几个主要的文件

all_TE.fa                  confident_non_ltr_0.fa         genome.rename.fa           HiTE.out
benchmarking.log           confident_non_ltr.fa           HiTE_clean.log             HiTE_split.log
chr_name.map               confident_other.fa             HiTE.tbl
confident_helitron_0.fa    confident_TE.cons.fa           HiTE.full_length.gff       intact_LTR.fa
confident_helitron.fa      confident_tir_0.fa             HiTE.gff                   intact_LTR.fa.classified
confident_ltr_cut.fa       confident_tir.fa               HiTE_lib.log               intact_LTR.list
confident_ltr.internal.fa  filtered_single_copy_LTRs.txt  HiTE.log                   panHiTE.log
confident_ltr.terminal.fa  HiTE_ltr.log

chr_name.map，是染色体名映射文件，部分基因组如果染色体注释太长，软件内部会进行映射。而且intact_LTR.list用的是映射后的染色体号，这导致了不能直接用gff文件

chr_0   chr1
chr_1   chr2
chr_2   chr3

HiTE.full_length.gff 全长LTR注释文件，包含了LTR的坐标，分类等，但不包含插入时间与identity

chr1    HiTE    LTR/Gypsy       8625    9446    .       +       .       id=te_intact_137;name=LTR_44-LTR;classification=LTR/Gypsy
chr1    HiTE    LTR/Gypsy       137939  138854  .       +       .       id=te_intact_1025;name=LTR_71-LTR;classification=LTR/Gypsy

intact_LTR.list 包含注释方法，注释类型，修饰，identity与分化时间（最后一列）

chr_12:14144754..14150825       pass    motif:TGCA      TSD:AAGTT       NA..NA  NA..NA  IN:14145067..14150512   0.9682540000000001      .       NA      NA      LTR/Copia       1247594
chr_12:15050810..15056912       pass    motif:TGTT      TSD:TTTTG       NA..NA  NA..NA  IN:15051136..15056585   0.984756.       NA      NA      LTR/Copia       592348

此时可以进行下一步，将基因组注释文件软链过来，改成gene.gff。HiTE.full_length.gff,intact_LTR.list ,chr_name.map,gene.gff四个文件遍准备好了。

3 运行如下脚本，中间会通过chr_name.map做基于重叠的HiTE.full_length.gff与intact_LTR.list 转换，给gff注释加上插入时间和identity。基于这个转换后的多列注释文件，与gene.gff可以制作出基因与LTR的overlap表格，完成不同identity内各个物种两种LTR插入与基因关系的分析。
第一个脚本，负责把插入时间和identity制作出一个bed文件，方便后续工作

#cat 1.py
import os

map_file = "chr_name.map"
list_file = "intact_LTR.list"
output_bed = "intact_info_for_merge.bed"

# 1. 读取染色体映射
chr_map = {}
with open(map_file, 'r') as f:
    for line in f:
        if line.strip():
            old, new = line.strip().split()
            chr_map[old] = new

# 2. 提取信息并写入 6 列 BED
with open(list_file, 'r') as f, open(output_bed, 'w') as out:
    for line in f:
        if line.startswith("#") or not line.strip():
            continue
        parts = line.strip().split()
        raw_coord = parts[0]
        c_name = raw_coord.split(":")[0]
        start, end = raw_coord.split(":")[1].split("..")

        real_chr = chr_map.get(c_name, c_name)
        identity = parts[7]  # Identity (第八列)
        ins_time = parts[12] # Time (第十三列)

        # 写入 6 列 BED: chr, start, end, name(占位), identity, time
        out.write(f"{real_chr}\t{start}\t{end}\t.\t{identity}\t{ins_time}\n")

print(f"生成的 BED 文件前几行：")
os.system(f"head -n 3 {output_bed}")

第二个bash脚本，负责生成完整的带有插入时间的gff文件

#cat 2.sh
# 1. 空间比对
bedtools intersect -a HiTE.full_length.gff -b intact_info_for_merge.bed -wa -wb -f 0.8 > matched_temp.txt

# 2. 格式化输出：倒数第二列是 Identity，最后一列是 Time
# $1-$9 是原 GFF 列，$14 是 BED 的 Identity, $15 是 BED 的 Time
awk 'BEGIN{FS="\t"; OFS="\t"} {
    print $1, $2, $3, $4, $5, $6, $7, $8, $9, $14, $15
}' matched_temp.txt > HiTE.final_with_id_time.gff

第三个脚本，读取基因坐标和增强型 LTR GFF，计算 Upstream/Downstream/Overlap，并附上Identity和插入时间信息。并且带上基因的位置坐标信息，方便人工排查。

import pandas as pd
import re
import os

def parse_attributes(attr_str):
    """解析 GFF 的第 9 列属性字段"""
    attrs = {}
    if not attr_str: return attrs
    for item in attr_str.split(';'):
        if '=' in item:
            k, v = item.split('=', 1)
            attrs[k.strip()] = v.strip()
    return attrs

def robust_read_gff(file_path, feature_filter=None):
    """
    针对可能缺失换行符的文件进行鲁棒性读取。
    """
    data = []
    if not os.path.exists(file_path):
        return pd.DataFrame()

    with open(file_path, 'r', errors='ignore') as f:
        content = f.read()

    # 如果文件物理上是一行，则按 Chr_ 强制切分
    if '\n' not in content and 'Chr_' in content:
        records = re.split(r'(?=Chr_)', content)
    else:
        records = content.splitlines()

    for rec in records:
        if rec.startswith('#') or not rec.strip():
            continue
        cols = rec.strip().split('\t')
        if len(cols) < 9:
            continue

        if feature_filter and cols[2] != feature_filter:
            continue

        data.append(cols)
    return pd.DataFrame(data)

def main():
    gene_file = 'gene.gff'
    ltr_file = 'HiTE.final_with_id_time.gff'
    output_file = 'LTR_affected_genes_with_coords.tsv'
    window = 2000

    print("正在加载基因数据并提取坐标...")
    raw_genes = robust_read_gff(gene_file, 'gene')
    if raw_genes.empty:
        print("错误：未找到基因数据。")
        return

    genes = []
    for _, row in raw_genes.iterrows():
        attr = parse_attributes(row[8])
        genes.append({
            'chrom': row[0],
            'g_start': int(row[3]),
            'g_end': int(row[4]),
            'g_strand': row[6],
            'gene_id': attr.get('ID', 'NA')
        })
    df_genes = pd.DataFrame(genes)

    print("正在加载 LTR 数据...")
    raw_ltrs = robust_read_gff(ltr_file)
    if raw_ltrs.empty:
        print("错误：未找到 LTR 数据。")
        return

    ltrs = []
    for _, row in raw_ltrs.iterrows():
        attr = parse_attributes(row[8])
        ltrs.append({
            'chrom': row[0],
            'l_start': int(row[3]),
            'l_end': int(row[4]),
            'l_type': attr.get('classification', 'NA'),
            'identity': row.iloc[-2], # 倒数第二列
            'time': row.iloc[-1]      # 最后一列
        })
    df_ltrs = pd.DataFrame(ltrs)

    print("正在计算空间关系并整合坐标...")
    results = []
    for chrom, g_grp in df_genes.groupby('chrom'):
        l_grp = df_ltrs[df_ltrs['chrom'] == chrom]
        if l_grp.empty: continue

        for _, g in g_grp.iterrows():
            # 查找 window 范围内的 LTR
            mask = (l_grp['l_start'] <= g['g_end'] + window) & (l_grp['l_end'] >= g['g_start'] - window)
            overlap_ltrs = l_grp[mask]

            for _, l in overlap_ltrs.iterrows():
                # 位置判定逻辑
                if l['l_start'] <= g['g_end'] and l['l_end'] >= g['g_start']:
                    relation = "Overlap"
                elif g['g_strand'] == '+':
                    relation = "Upstream" if l['l_end'] < g['g_start'] else "Downstream"
                else: # '-' strand
                    relation = "Downstream" if l['l_end'] < g['g_start'] else "Upstream"

                # 整合所有信息
                results.append({
                    'Gene_ID': g['gene_id'],
                    'Chrom': chrom,
                    'Gene_Start': g['g_start'],
                    'Gene_End': g['g_end'],
                    'Strand': g['g_strand'],
                    'Relationship': relation,
                    'LTR_Identity': l['identity'],
                    'Insertion_Time': l['time'],
                    'LTR_Start': l['l_start'],
                    'LTR_End': l['l_end'],
                    'LTR_Type': l['l_type']
                })

    if results:
        final_df = pd.DataFrame(results)
        # 按照染色体和起始位置排序，方便人工阅读
        final_df = final_df.sort_values(['Chrom', 'Gene_Start'])
        final_df.to_csv(output_file, sep='\t', index=False)
        print(f"完成！结果已保存至 {output_file}。")
        print("前 5 行预览：")
        print(final_df.head())
    else:
        print("未发现符合条件的关联。")

if __name__ == "__main__":
    main()

代码水平有限，全靠gemini实现

LTR插入与基因的关系