【技能】使用Python提取基因结构bed文件

为了方便提取基因结构,写了一个简单的Python脚本,支持提取:
  • exon
  • intron
  • UTR
  • CDS
  • start codon
  • end codon
其中关于起始终止密码子的提取可能还存在bug,期待大家一起来debug。

此外,代码的风格也不是“熟练老手”,仍有很大进步空间,没有做到“优雅”,期待在新版本能有所进步。

#!/usr/bin/python3

import argparse
import textwrap

parser = argparse.ArgumentParser(prog = "getFeature.py",
                 description = textwrap.dedent('''\
                         This is a python tool for you to extract gene features in bed file format from bed13 file.

                         About how to convert GTF annotation file into bed13 file:
                         - https://www.jianshu.com/p/de2455a8f507
                         About the structure of transcripts:
                         - https://www.jianshu.com/p/f97935ce7255
                         '''),
                 formatter_class = argparse.RawDescriptionHelpFormatter,
                 epilog = "Owner: skm@smail.nju.edu.cn, Nanjing University."
                 )

parser.add_argument('--bed13', '-b', required = True, help = "the bed13 file.")
parser.add_argument('--prefix', '-p', required = True, help = "the prefix of the file to store the result.")
parser.add_argument('--feature', '-f', required = True, choices = ['exon', 'intron', '5utr', '3utr', 'start_codon', 'stop_codon', 'cds'], help = "the feature type you want to extract.")
parser.add_argument('--keep', '-k', action = 'store_true', help = "whether to keep non-main chr, default is not.") 
parser.add_argument('--version', '-v', action = 'version', version='%(prog)s 1.0.0')

args = parser.parse_args()

#input
bed = args.bed13
prefix = args.prefix
feature = args.feature


#load bed13 file
with open(bed, 'r') as input_file:
    tx_dict = {}
    chrs_keep = ['chrM', 'chrX', 'chrY']
    for i in range(22):
        chrs_keep.append('chr' + str(i+1))
    for line in input_file.readlines():
        line = line.strip()
        fields = line.split("\t")
#       chr = fields[0]
#       start = fields[1]
#       end = fields[2]
#       name = fields[3]
#       strand = fields[5]
#       cds_start = fields[6]
#       cds_end = fields[7]
#       num = fields[9]
#       exon_size = fields[10]
#       exon_start = fields[11]

        name = fields.pop(3)
        del fields[3]
        del fields[6]
        if args.keep:
            tx_dict[name] = fields
        else:
            if fields[0] in chrs_keep:
                tx_dict[name] = fields
#       chr, start, end, strand, cds_start, cds_end, num, exon_size, exon_start
#function to extract feature
#exon
def get_exon(dict, dest):
    file = dest + '.exon.bed'
    with open(file, 'w') as exon:
        for tx,feat in dict.items():
            for i in range(int(feat[6])):
                curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + "_" + str(i+1), '0', feat[3]]
                exon.write("\t".join(curr_line) + "\n")
#intron
def get_intron(dict, dest):
    file = dest + '.intron.bed'
    with open(file, 'w') as intron:
        for tx,feat in dict.items():
            if int(feat[6]) != 1:
                for i in range(int(feat[6]) - 1):
                    curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                    curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                    next_exon_start = int(feat[1]) + int(feat[8].split(",")[i+1])
                    curr_line = [feat[0], str(curr_exon_end), str(next_exon_start), tx + "_" + str(i+1), '0', feat[3]]
                    intron.write("\t".join(curr_line) + "\n")


#cds
def get_cds(dict, dest):
    file = dest + '.cds.bed'
    with open (file, 'w') as cds:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                index = 0
                for i in range(int(feat[6])):
                    curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                    curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                    bool_1 = curr_exon_start < int(feat[4]) and curr_exon_end > int(feat[4])
                    bool_2 = curr_exon_start >= int(feat[4]) and curr_exon_end <= int(feat[5])
                    bool_3 = curr_exon_start <= int(feat[5]) and curr_exon_end > int(feat[5])
                    bool_4 = curr_exon_start <= int(feat[4]) and curr_exon_end >= int(feat[5])
                    if bool_1:
                        index += 1
                        curr_line = [feat[0], str(feat[4]), str(curr_exon_end), tx + "_" + str(index), '0', feat[3]]
                        cds.write("\t".join(curr_line) + "\n")
                    elif bool_2:
                        index += 1
                        curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + "_" + str(index), '0', feat[3]]
                        cds.write("\t".join(curr_line) + "\n")
                    elif bool_3:
                        index += 1
                        curr_line = [feat[0], str(curr_exon_start), str(feat[5]), tx + "_" + str(index), '0', feat[3]]
                        cds.write("\t".join(curr_line) + "\n")
                    elif bool_4:
                        index += 1
                        curr_line = [feat[0], str(feat[4]), str(feat[5]), tx + "_" + str(index), '0', feat[3]]
                        cds.write("\t".join(curr_line) + "\n")

#start_codon
def get_start_codon(dict, dest):
    file = dest + '.startcodon.bed'
    with open(file, 'w') as startcodon:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                if feat[3] == "+":
                    curr_line = [feat[0], feat[4], str(int(feat[4]) + 3), tx, '0', feat[3]]
                else:
                    curr_line = [feat[0], str(int(feat[5]) - 3), feat[5], tx, '0', feat[3]]
                startcodon.write("\t".join(curr_line) + "\n")
#stop_codon
def get_stop_codon(dict, dest):
    file = dest + '.stopcodon.bed'
    with open(file, 'w') as stopcodon:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                if feat[3] == "+":
                    curr_line = [feat[0], str(int(feat[5]) - 3), feat[5], tx, '0', feat[3]]
                else:
                    curr_line = [feat[0], feat[4], str(int(feat[4]) + 3), tx, '0', feat[3]]
                stopcodon.write("\t".join(curr_line) + "\n")


#5-utr:
def get_five_utr(dict, dest):
    file = dest + '.5utr.bed'
    with open(file, 'w') as fiveutr:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                if feat[3] == "+":
                    for i in range(int(feat[6])):
                        curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                        curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                        bool_1 = curr_exon_start < int(feat[4]) and curr_exon_end <= int(feat[4])
                        bool_2 = curr_exon_start < int(feat[4]) and curr_exon_end > int(feat[4])
                        if bool_1:
                            curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + '_' + str(i+1), '0', feat[3]]
                            fiveutr.write("\t".join(curr_line) + "\n")
                        elif bool_2:
                            curr_line = [feat[0], str(curr_exon_start), feat[4], tx + '_' + str(i+1), '0', feat[3]]
                            fiveutr.write("\t".join(curr_line) + "\n")
                else:
                    index = 0
                    for i in range(int(feat[6])):
                        curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                        curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                        bool_1 = curr_exon_start > int(feat[5]) and curr_exon_end > int(feat[5])
                        bool_2 = curr_exon_start <= int(feat[5]) and curr_exon_end > int(feat[5])
                        if bool_1:
                            index += 1
                            curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + '_' + str(index), '0', feat[3]]
                            fiveutr.write("\t".join(curr_line) + "\n")
                        elif bool_2:
                            index += 1
                            curr_line = [feat[0], feat[5], str(curr_exon_end), tx + '_' + str(index), '0', feat[3]]
                            fiveutr.write("\t".join(curr_line) + "\n")
#3-utr
def get_three_utr(dict, dest):
    file = dest + '.3utr.bed'
    with open(file, 'w') as threeutr:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                if feat[3] == "+":
                    index = 0
                    for i in range(int(feat[6])):
                        curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                        curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                        bool_1 = curr_exon_start > int(feat[5]) and curr_exon_end > int(feat[5])
                        bool_2 = curr_exon_start <= int(feat[5]) and curr_exon_end > int(feat[5])
                        if bool_1:
                            index += 1
                            curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + '_' + str(index), '0', feat[3]]
                            threeutr.write("\t".join(curr_line) + "\n")
                        elif bool_2:
                            index += 1
                            curr_line = [feat[0], feat[5], str(curr_exon_end), tx + '_' + str(index), '0', feat[3]]
                            threeutr.write("\t".join(curr_line) + "\n")
                else:
                    for i in range(int(feat[6])):
                        curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                        curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                        bool_1 = curr_exon_start < int(feat[4]) and curr_exon_end <= int(feat[4])
                        bool_2 = curr_exon_start < int(feat[4]) and curr_exon_end > int(feat[4])
                        if bool_1:
                            curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + '_' + str(i+1), '0', feat[3]]
                            threeutr.write("\t".join(curr_line) + "\n")
                        elif bool_2:
                            curr_line = [feat[0], str(curr_exon_start), feat[4], tx + '_' + str(i+1), '0', feat[3]]
                            threeutr.write("\t".join(curr_line) + "\n")


if __name__ == '__main__':
    print ("Processing...")
    if feature == "exon":
        get_exon(dict = tx_dict, dest = prefix)
    elif feature == "intron":
        get_intron(dict = tx_dict, dest = prefix)
    elif feature == "cds":
        get_cds(dict = tx_dict, dest = prefix)
    elif feature == "start_codon":
        get_start_codon(dict = tx_dict, dest = prefix)
    elif feature == "stop_codon":
        get_stop_codon(dict = tx_dict, dest = prefix)
    elif feature == "5utr":
        get_five_utr(dict = tx_dict, dest = prefix)
    elif feature == "3utr":
        get_three_utr(dict = tx_dict, dest = prefix)
    print ("Finished...")
使用方法:
getFeature.py --help
usage: getFeature.py [-h] --bed13 BED13 --prefix PREFIX --feature
                     {exon,intron,5utr,3utr,start_codon,stop_codon,cds}
                     [--keep] [--version]

This is a python tool for you to extract gene features in bed file format from bed13 file.

About how to convert GTF annotation file into bed13 file:
- https://www.jianshu.com/p/de2455a8f507
About the structure of transcripts:
- https://www.jianshu.com/p/f97935ce7255

optional arguments:
  -h, --help            show this help message and exit
  --bed13 BED13, -b BED13
                        the bed13 file.
  --prefix PREFIX, -p PREFIX
                        the prefix of the file to store the result.
  --feature {exon,intron,5utr,3utr,start_codon,stop_codon,cds}, -f {exon,intron,5utr,3utr,start_codon,stop_codon,cds}
                        the feature type you want to extract.
  --keep, -k            whether to keep non-main chr, default is not.
  --version, -v         show program's version number and exit

Owner: skm@smail.nju.edu.cn, Nanjing University.

感兴趣的小伙伴可以在 我的Github 上下载 getFeature.py或复制粘贴上面的code进行使(试)用。

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,445评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,889评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,047评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,760评论 1 276
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,745评论 5 367
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,638评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,011评论 3 398
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,669评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,923评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,655评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,740评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,406评论 4 320
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,995评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,961评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,197评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,023评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,483评论 2 342

推荐阅读更多精彩内容