「基因组」着丝粒和端粒鉴定软件

最近状态不佳，连续两件事情做的时候只想到了一半，做了又等于没做，都成了自己预想的最差的结果，要想做到最佳，只有重做，现在浪费时间结果不合适等于白做。一心多用，今天白搞一下午，越是忙碌的时候，越是出错，生活太难了。基因组分析是很个性的东西，不是流水生产线，做的越是快，问题越多，返工越多，一个项目也就是做了两次基因预测和两次HiC，要坚强！

（1）端粒（telomere）检测软件（更新中）

0.端粒数据库

https://telomerase.asu.edu/sequences_telomere.html（现在打不开了，墙外面可以打开，一般植物是7碱基重复，如AAACCCT，也存在不同染色体端粒不一样的情况，如人参T2T：https://doi.org/10.1093/hr/uhae107
）

	software	Download	Time	Need	Note
1	FindTelomeres	https://github.com/JanaSperschneider/FindTelomeres		可以只用genome，需要genome和gff3（可以改加上这个gff输入）	可以修改脚本替换重复单元
2	tidk（telomeric-identifier）	https://github.com/tolkit/telomeric-identifier		需要genome
3	quarTeT	https://github.com/aaranyue/quarTeT http://www.atcgn.com:8080/quarTeT/home.html	2023	需要genome, 调用tidk（telomeric-identifier），可以补gap，鉴定端粒，着丝粒	https://doi.org/10.1093/hr/uhad127
4	拿端粒序列			例如CCCATTT at the 5′ end and TTTAGGG at the 3′ end查找，seqtk
5	VGP	https://github.com/VGP/vgp-assembly			蓝莓T2T，HR：https://doi.org/10.1093/hr/uhad209

综上
1.染色体一端和另外一端的端粒序列应该是反向互补的，调整HiC的时候应该注意末端的这种情况，实践经验发现存在末尾很短会挂反的情况，可以提取首位一段50kb区域进行查看。
2.有的物种端粒区域很长，有的很短，跟组装好坏水平也有一点关系，一般认为可能都是kb级别以上比较好（目前也没看到标准）。
3.不同染色体存在端粒不一样的情况。
4.有的物种端粒特殊。

端粒延伸

1.Telomere extensions were conducted using minimap2 (v2.24)21, medaka consensus (v1.7.2; https://github.com/nanoporetech/medaka) and blastn (v2.11.0+; ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). 来源：https://www.nature.com/articles/s41597-025-04793-4
2.teloclip：https://github.com/Adamtaranto/teloclip

着丝粒（centromere）检测软件（待更新）

1.Centromics software：

(https://github.com/ShuaiNIEgithub/Centromics)，需要基因组，不需要注释文件，不需要IGV。

2.Telomeres_and_Centromeres

https://github.com/Immortal2333/Telomeres_and_Centromeres 这个是需要IGV结果注释文件看。

3.HiCAT：

https://github.com/865699871/HiCAT #需要参考的着丝粒区域，genome。

4.quarTeT

（调用tidk（telomeric-identifier），可以补gap，鉴定端粒，着丝粒）https://github.com/aaranyue/quarTeT
或者http://www.atcgn.com:8080/quarTeT/home.html

5.Tandem Repeat Finder (TRF)

https://link.zhihu.com/?target=http%3A//tandem.bu.edu/trf/trf.html 一般文章好像没写具体的做法。

6.TRASH

TRASH https://github.com/vlothec/TRASH

7.CentIER 2024年8月8日，中国农业科学院农业基因组研究所潘玮华团队在Plant Communications发表

各项准确性预测指标高于同类型软件20%以上。源程序及测试文件可由github（https://github.com/simon19891216/CentIER/releases/tag/CentIERv2.0）下载。对于登录github有困难的用户可以选择到https://gitee.com/SimonX19891216/CentIER

3.软件用法

1.Centromics 安装及使用

git clone --recurse-submodules https://github.com/zhangrengang/Centromics
cd Centromics

# install
conda env create -f Centromics.yaml
conda activate RepCent
./install.sh

# start
cd example_data
# long reads
centromics -l hifi.fq.gz -g ref.fa

# long reads + HiC data + ChIP data
centromics -l hifi.fq.gz -g ref.fa -pre hifi -chip chip.bam -hic merged_nodups.hic
centromics -l ont*.fq.gz -g ref.fa -pre ont  -chip chip.bam -hic merged_nodups.hic

centromics  -l  ccs.fq.gz/ont.fq.gz/ccs.fq  -g genb=ome.fa  -pre out -outdir ./ -tmpdir {}.tmp -ncpu 10 -min_ratio  0.03
/share/nas1/yangp/01.software/anaconda3/envs/RepCent/bin/centromics -h 
usage: centromics [-h] [-g FILE] -l FILE [FILE ...] [-hic FILE] [-chip FILE] [-pre STR] [-o DIR]
                  [-tmpdir DIR] [-subsample_x INT] [-subsample_n INT] [-trf_opts STR]
                  [-min_cov FLOAT] [-min_len INT] [-min_monomer_len INT] [-clust_opts STR]
                  [-min_ratio FLOAT] [-window_size INT] [-chr_prefix STR] [-p INT] [-cleanup]
                  [-overwrite] [-v]

Cluster Repeat Sequences.

optional arguments:
  -h, --help            show this help message and exit

Input:
  -g FILE, -genome FILE
                        Genome FASTA file
  -l FILE [FILE ...], -long FILE [FILE ...]
                        Long whole-genome-shotgun reads such as PacBio CCS/CLR or ONT reads in fastq
                        or fasta format [required]
  -hic FILE             Hi-C data alignments by juicer
  -chip FILE            ChIP data alignments in bam format (sorted)

Output:
  -pre STR, -prefix STR
                        Prefix for output [default=centomics]
  -o DIR, -outdir DIR   Output directory [default=cent-output]
  -tmpdir DIR           Temporary directory [default=tmp]

Kmer matrix:
  -subsample_x INT      Subsample long reads up to X depth (prior to `-subsample_n`) [default=5]
  -subsample_n INT      Subsample long reads up to N reads [default=100000]
  -trf_opts STR         TRF options to identify tandem repeats on a read [default='1 1 2 80 5 200
                        2000 -d -h']
  -min_cov FLOAT        Minimum coverage of tandem repeats for a read [default=0.9]
  -min_len INT          Minimum length of tandem repeats for a read [default=100]
  -min_monomer_len INT  Minimum monomer length of a tandem repeat [default=1]
  -clust_opts STR       REPclust options to cluster tandem repeat units [default='-m jaccard -k 15
                        -c 0.2 -x 2 -I 2']
  -min_ratio FLOAT      Minimum relative mass ratio to filter tandem repeats [default=0.1]

Circos:
  Options for circos plot

  -window_size INT      Window size (bp) for circos plot [default=50000]
  -chr_prefix STR       match chromosome to only plot chromosomes [default="chr[\dXYZW]+"]

Other options:
  -p INT, -ncpu INT     Maximum number of processors to use [default=160]
  -cleanup              Remove the temporary directory [default=False]
  -overwrite            Overwrite even if check point files existed [default=False]
  -v, -version          show program's version number and exit

结果文件：有ont数据优先使用，没有则用ccs数据
*.candidate_peaks.bed，候选的centomics区域。
out.circos_legend.pdf #out.circos.png中不同颜色代表的不同类型TRF
out.circos_legend.txt #out.circos.png 两圈的含义
out.circos.pdf #不同类型的TRF的密度图
out.circos.png
Centromics.txt #不同类型的TRF的数目
out.trf.count #按bin统计的不同类型的TRF的数目
data/genome_karyotype.txt #核型文件
data/tr_density.txt #圈图画图文件，不同类型的TRF数目

https://github.com/zhangrengang/Centromics/issues/6

文献来源：

着丝粒：

说明：各种方法的原理基本都差不多，着丝粒基本都是基于trf结果，端粒都是找重复单元，大同小异。
实测结果：目前对于端粒和着丝粒的完整性并没有直接的定义，端粒的组装长度实测跟数据量有关系，深度越深越好，组装长度可能越长，首尾反向互补；着丝粒可能更需要实验和文献结果，多个软件的结果和结合hic热图进行验证。