提取最长转录本是一个常见的功能,本脚本会生产两个文件maxpep文件为仅含有最长转录本的fasta文件,maxpep.list为最长转录本的列表
原文件格式如下,这是其中一个基因:
>OsZS_01T0000300.1
MSSAAGQDNGDTAGDYIKWMCGAGGRAGGAMANLQRGVGSLVRDIGDPCLNPSPVKGSKM
LKPEKWHTCFDNDGKVIGFRKALKFIVLGGVDPTIRAEVWEFLLGCYALSSTSEYRRKLR
AVRREKYQILVRQCQSMHPSIGTEVNNCSKDSQDYNDMGEPRYDTETFDDYPSLPVTNFF
STDGVGSNGVDKNHCSFSVPEDRLRHRDERMHSFQINNNIDLIIESNSCSSDVFRASNSD
SAIFHSDAYKQDRWLDDNGYNREVIDSLRISDAPEADFVDGTKSNSVVASKDRVSEWLWT
LHRIVVDVVRTDSHLDFYGESRNMARMSDILAVYAWVDPSTGYCQGMSDLLSPFVVLYED
DADAFWCFEMLLRRMRENFQMEGPTGVMKQLQALWKIMEITDVELFEHLSTIGAESLHFA
FRMLLVLFRRELSFEESLSMWEVWHLKTHMMWAADFNEDVILHLEENCLEPLLVDMRNDL
SCEVKEEHRVNSYTRRKSKSRKPHHRNGEMRVACNLGMKPNTRNPLCGLSGATIWARHQQ
MPHISTNVLAKNGDDDLPIFCVAAILVINRHKIIRETRSIDDAIKASLSHNFLSYCLQTS
DPWNYYQELPIFALDFMFNDNMLKINVKRCVRMAIKLRKKYIYKLLKGGSE*
>OsZS_01T0000300.2
MSSAAGQDNGDTAGDYIKWMCGAGGRAGGAMANLQRGVGSLVRDIGDPCLNPSPVKGSKM
LKPEKWHTCFDNDGKVIGFRKALKFIVLGGVDPTIRAEVWEFLLGCYALSSTSEYRRKLR
AVRREKYQILVRQCQSMHPSIGTGELAYAVGSKLMDVRTMSKETHIAEEVSTSQQTSQNT
AGSLVEDSDYGPGGAQQSQKRESCSKSAELVGFNVHNDTSLYDSSNFIVSSTEVNNCSKD
SQDYNDMGEPRYDTETFDDYPSLPVTNFFSTDGVGSNGVDKNHCSFSVPEDRLRHRDERM
HSFQINNNIDLIIESNSCSSDVFRASNSDSAIFHSDAYKQDRWLDDNGYNREVIDSLRIS
DAPEADFVDGTKSNSVVASKDRVSEWLWTLHRIVVDVVRTDSHLDFYGESRNMARMSDIL
AVYAWVDPSTGYCQGMSDLLSPFVVLYEDDADAFWCFEMLLRRMRENFQMEGPTGVMKQL
QALWKIMEITDVELFEHLSTIGAESLHFAFRMLLVLFRRELSFEESLSMWEVWHLKTHMM
WAADFNEDVILHLEENCLEPLLVDMRNDLSCEVKEEHRVNSYTRRKSKSRKPHHRNGEMR
VACNLGMKPNTRNPLCGLSGATIWARHQQMPHISTNVLAKNGDDDLPIFCVAAILVINRH
KIIRETRSIDDAIKASLSHNFLSYCLQTSDPWNYYQELPIFALDFMFNDNMLKINVKRCV
RMAIKLRKKYIYKLLKGGSE*
>OsZS_01T0000300.3
MSSAAGQDNGDTAGDYIKWMCGAGGRAGGAMANLQRGVGSLVRDIGDPCLNPSPVKGSKM
LKPEKWHTCFDNDGKVIGFRKALKFIVLGGVDPTIRAEVWEFLLGCYALSSTSEYRRKLR
AVRREKYQILVRQCQSMHPSIGTGELAYAVGSKLMDVRTMSKETHIAEEVSTSQQTSQNT
AGSLVEDSDYGPGGAQQSQKRESCSKSAELVGFNVHNDTSLYDSSNFIVSSTEVNNCSKD
SQDYNDMGEPRYDTETFDDYPSLPVTNFFSTDGVGSNGVDKNHCSFSVPEDRLRHRDERM
HSFQINNNIDLIIESNSCSSDVFRASNSDSAIFHSDAYKQDRWLDDNGYNREVIDSLRIS
DAPEADFVDGTKSNSVVASKDRVSEWLWTLHRIVVDVVRTDSHLDFYGESRNMARMSDIL
AVYAWVDPSTGYCQGMSDLLSPFVVLYEDDADAFWCFEMLLRRMRENFQMEGPTGVMKQL
QALWKIMEITDVELFEHLSTIGAESLHFAFRMLLVLFRRELSFEESLSMWEVWHLKTHMM
WAADFNEDVILHLEENCLEPLLVDMRNDLSCEMFNDNMLKINVKRCVRMAIKLRKKYIYK
LLKGGSE*
观察可得:OsZS_01T0000300这个基因在这里有三个转录本,我们的目标是只剩下最长的转录本OsZS_01T0000300.2,结果如下:
>OsZS_01T0000300.2
MSSAAGQDNGDTAGDYIKWMCGAGGRAGGAMANLQRGVGSLVRDIGDPCLNPSPVKGSKM
LKPEKWHTCFDNDGKVIGFRKALKFIVLGGVDPTIRAEVWEFLLGCYALSSTSEYRRKLR
AVRREKYQILVRQCQSMHPSIGTGELAYAVGSKLMDVRTMSKETHIAEEVSTSQQTSQNT
AGSLVEDSDYGPGGAQQSQKRESCSKSAELVGFNVHNDTSLYDSSNFIVSSTEVNNCSKD
SQDYNDMGEPRYDTETFDDYPSLPVTNFFSTDGVGSNGVDKNHCSFSVPEDRLRHRDERM
HSFQINNNIDLIIESNSCSSDVFRASNSDSAIFHSDAYKQDRWLDDNGYNREVIDSLRIS
DAPEADFVDGTKSNSVVASKDRVSEWLWTLHRIVVDVVRTDSHLDFYGESRNMARMSDIL
AVYAWVDPSTGYCQGMSDLLSPFVVLYEDDADAFWCFEMLLRRMRENFQMEGPTGVMKQL
QALWKIMEITDVELFEHLSTIGAESLHFAFRMLLVLFRRELSFEESLSMWEVWHLKTHMM
WAADFNEDVILHLEENCLEPLLVDMRNDLSCEVKEEHRVNSYTRRKSKSRKPHHRNGEMR
VACNLGMKPNTRNPLCGLSGATIWARHQQMPHISTNVLAKNGDDDLPIFCVAAILVINRH
KIIRETRSIDDAIKASLSHNFLSYCLQTSDPWNYYQELPIFALDFMFNDNMLKINVKRCV
RMAIKLRKKYIYKLLKGGSE*
使用嵌套的字典实现,也就是生成
seqlist = {"OsZS_01T0000300": {"OsZS_01T0000300.1":seq1, "OsZS_01T0000300.2": seq2, "OsZS_01T0000300.3": seq3}}这样的字典.再比较seqlist["OsZS_01T0000300"] 内最长的seq是哪一条,代码如下:
import sys
import re
filename = 'F:/biosoft/ATA/ATA.pep'
seqlist = {}
i = 0
with open(filename,'r') as r:
lines=r.readlines()
for l in lines:
i += 1
linecontent = l.strip()
if linecontent.startswith(">"):
if i > 1:
name = linecontent[1:]
id = name[0: name.rfind(".",1)]
if id not in seqlist.keys():
seqlist[id] = {}
seqlist[id][name] = ""
elif i == 1:
name = linecontent[1:]
id = name[0: name.rfind(".",1)]
seqlist[id] = {}
seqlist[id][name] = ""
else:
seqlist[id][name] = seqlist[id][name] + linecontent
maxseq = {}
for k,v in seqlist.items():
d=0
for k1,v1 in v.items():
d +=1
if d ==1:
maxseq[k1] = v1
maxlen = len(v1)
k_last = k1
else:
if len(v1) > maxlen:
maxseq[k1] = v1
del maxseq[k_last]
k_last = k1
#生成结果
result1 = filename + ".max"
result2 = filename + ".max.list"
result3 = filename + ".max.len"
with open(result1,'w') as w1:
with open(result2,'w') as w2:
with open(result3,'w') as w3:
for key,value in maxseq.items():
w1.write(">" + key + "\n" + value + "\n")
w2.write(key + "\n")
w3.write(key + "\t" + str(len(value)) +"\n")