2019-08-20 【三代测序】有参转录组注释评估软件 MatchAnnot

matchAnnot iso-seq注释软件

https://github.com/TomSkelly/MatchAnnot

MatchAnnot is a python script which accepts a SAM file of IsoSeq transcripts aligned to a genomic reference and matches them to an annotation database in GTF format.
The aligner used must be splice-aware. MatchAnnot has been developed using the STAR aligner (http://code.google.com/p/rna-star).

安装

到github上下载zip文件,解压或者运行以下
unzip MatchAnnot-master.zip
进文件夹找相应py文件运行即可。
其中matchAnnot.py and clusterView.py可以命令行直接运行,前面加路径,或者简单些,放到path里
还是要注意基因组和注释是否匹配,一定要匹配
如果输入SAM文件,一定要sort过的

运行

很简单

Usage: matchAnnot.py [options] <SAM_file> ...

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --gtf=GTF             annotations file, in format specified by --format
  --format=FORMAT       annotations in alternate gtf format (def: standard)
  --clusters=CLUSTERS   cluster_report.csv file name (optional)
  --vars                print variants for each cluster (def: no)
  --outpickle=OUTPICKLE
                        matches in pickle format (optional)

输入命令行
matchAnnot.py --gtf gencode.v19.annotation.gtf myData.sam > annotations.out

input 文件需求如下

MatchAnnot expects the following inputs:

    --gtf          Annotation file, in format as described by --format option (Mandatory).
    --format       Format of annotation file: 'standard', 'alt' or 'pickle' (default: standard).
    --clusters     cluster_report.csv as produced by IsoSeq (Optional).
    (pipe or arg)  SAM file of IsoSeq transcripts aligned to genomic reference (Mandatory).

输出文件格式如下

The output of the gencode_isoseq.pl script contains several types of line:

isoform:     A mapped isoform, output of IsoSeq. Line shows isoform name,
             and start and end genomic coordinates of alignment.

cigar:       The cigar string from the SAM file entry for the isoform.   从SAM读取

cl:          *A list of the reads-of-insert which were clustered to create the
             isoform*. This information is printed only if a cluster report file
             is supplied via the --clusters parameter. Each line lists one or 
         more reads from a single SMRTcell, labelled as either full-length
             or non-FL. The mapping from SMRTcell number to full SMRTcell name
         is in the summary at the end of the output.

polyA:   A list of the positions where polyadenylation motifs were found 
             near the 3' end of the isoform.    可统计polyA信号出现位置、motif等

gene:   A gene in the annotation file whose position overlaps the
             aligned isoform. Line shows gene name, its start and end
             coordinates, and the differences between those and the
             isoform start and end.
***
tr:          An annotated transcript of the gene under consideration. Line
             shows transcript name, a score, and the exon-to-exon
             mapping. Each [] grouping in the exon mapping is
             a list of transcript exons which match the isoform exons
             (see example below). Scores are as follows:

             5: IsoSeq exons match annotation exons one-for-one. Sizes agree
                except for leading and trailing edges.

             4: Like 5, but leading and trailing edge sizes differ by a 
                larger amount than the score-5 transcript found for this gene.

             3: One-for-one exon match, but sizes of internal exons disagree.

             2: The best match among all score=1 transcripts.

             1: Some exons overlap, overlaps are 1-for-1 where they exist.

         0: Everyting else: isoform overlaps gene, but little or
            no exon congruance.

exon:        Details of a single exon match. Shown only for transcripts
             with score >= 3. Line shows isoform and transcript start and
             stop coordinates and the delta between them, plus the
             number of indels found in the alignment (per the cigar
             string).

result:      A one-line summary for the isoform, showing the best gene and
             trancript found, and the resulting score.

summary:     Bookkeeping information at the end.


An example of an exon mapping (exons are numbered from 0):

                                          1                         2                               3               4                     5
   isoform:        ==========    ======    ==============  ===         =======
   transcript      =====        =========    ====    =========  =====    ========
                                 1                       2                      3                 4                    5                6

   maps as follows:

   [1] [2] [3,4] [4] [6]


   An ideal mapping is one-for-one:

   [1] [2] [3] [4] [5]


   To make it *really* ideal, the exon coordinates should be equal as well (or nearly so).

貌似比较重要的注释分级参数,解读。吐槽下某公司的翻译,对于score1的翻译是
“转录本与注释到的已知转录本外显子一一对应,但是仅有部分外显子重叠; ”
一直没弄明白咋回事,看原文才清楚了,基本拧了啊

https://github.com/TomSkelly/MatchAnnot/wiki/How-to-Interpret-matchAnnot-Output

tr:          An annotated transcript of the gene under consideration. Line shows transcript name, a score, and the exon-to-exon mapping. Each [] grouping in the exon mapping is a list of transcript exons which match the isoform exons (see example below). Scores are as follows:

             5: IsoSeq exons match annotation exons one-for-one. Sizes agree except for leading and trailing edges.
                      PB转录本与注释到的已知转录本外显子完全一一对应,仅在转录本起始和终止区域的末端有差别; 
             4: Like 5, but leading and trailing edge sizes differ by a larger amount than the score-5 transcript found for this gene.
                      PB转录本与注释到的已知转录本外显子完全一一对应,类似类型5,不过在转录本起始和终止区域的末端差异较大;  
             3: One-for-one exon match, but sizes of internal exons disagree.
                      转录本与注释到的已知转录本外显子一一对应(结构一致?),但是中间外显子大小会有差别;
             2: The best match among all score=1 transcripts.
                      在所有的 score=1的转录本中最匹配的转录本?(比较费解) 
             1: Some exons overlap, overlaps are 1-for-1 where they exist.
                     PB transcripts仅匹配到部分外显子,但可以与已知转录本外显子一一对应。
         0: Everyting else: isoform overlaps gene, but little or no exon congruance.
                     转录本在基因区间内,但是与已知转录本的外显子基本没有重叠。 


都没有说某个分数一定要1 Vs 1对应外显子, 看结果吧

参考资料
https://github.com/TomSkelly/MatchAnnot/wiki/How-to-Run-matchAnnot
https://github.com/TomSkelly/MatchAnnot/wiki/How-to-Interpret-matchAnnot-Output
https://github.com/TomSkelly/MatchAnnot/wiki

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,928评论 6 509
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,748评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 166,282评论 0 357
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 59,065评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,101评论 6 395
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,855评论 1 308
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,521评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,414评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,931评论 1 319
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,053评论 3 340
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,191评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,873评论 5 347
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,529评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,074评论 0 23
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,188评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,491评论 3 375
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,173评论 2 357