提取最长转录本是一个经常需要做的工作,本脚本会产生三个文件:.fa.max(所需结果);.len
(提取序列长度);.len.list(基因列表)。
本文特点批量操作!!!!
我这次需要达到的目标是提取每个物种的最长序列!也适用于提取最长转录本!!!!
示例文件:
>Aco|Aqcoe2G366900.1
MDGAATPMEVNDDDISTVEKTVRIEDVHSKSKDSIVLIISPRFIQPKNVDPKFIDIIANIYLVTRRVFNLRVEEISERVS
TEDVVGSEAKEACTICMDELSLDDRPVLTLPCSHDFHFLCLSKWIKDHNSCPLC
>Aco|Aqcoe2G002600.1
MSSSGNTHWCYSCRRPVRLRGRDAVCPYCSGGFVQELNEVEGISPLDFFGIDSDDERDQRFGVMEAFSALMRQRLAGRNRELDIRGRSGVVPDHGPGVNSGPWLIFSGQIPVHMSENGGIEVLLNGGPTVGLRRANVSDYFVGPGLDELIEQLTRNDRRGPPPAAQSAIDSMPTVKISHKHLRTDSHCPVCKERFELGTEARQMPCNHMYHSDCIVPWLVQHNSCPVCRLELPPQGSTSAHSSQSSSGGSRSGTASSNTSGRENSSESQTRRHPLSFLWPFRSSNTSTNTTNSNSNSNRSEPAASASAPLHEDNHQMHYTGWPFDY*
>Aco|Aqcoe2G351800.1
MAGMLPGVELARRRRTNHHHHHHHHQFEFTSNSREPTTLHHHQRLEPSSSMDETALMARKRLEEKLRGFSSAPSRWSKQQSLKGDKSYTVAYVKETHASKRQGKQIVQLDRRNSQREVCSICLEDFQAQQQVMNLPCCHRYHSDCLLPWLSAHSHCPYCRTKVMSSAN*
>Aco|Aqcoe2G264300.1
MSNDEHETSFTTICCVGREAFWCEDKVENVFVESEVSIDIYITLVDQWITSPQEEDYSDVDPPVLLRQISTICSEVAAIIMQEKPKNISVNIVVEFVHTVLCDVQAFNESFMDEEVDPILVPAAPSYVEALKTKQYDKENSKEESCSVCYQEFLMAEDVTLMPCSHIFHNNCIGRWLEMTNTCPMCRFTMPTFV*
>Aco|Aqcoe2G072200.1
MIDSSYSELEHEIAKHHLLYFLSFTEKGSNPIAYLGNIAWLAKQRAEETGGDYESIWLETFNSISSILIQEEAKTKLHDVVVLDGDEDDMICAICHEEFEVGFRGFGNTAMDCGHMFHRNCIYEWFKNKPNCPVCRHERKV*
目标提取Aco|Aqcoe2G002600.1
使用嵌套的字典实现,代码如下:
import re
import sys
file =open("/home/data2/EvoGenome/xialf/orthfinder/OrthoFinder/Results_Apr20/24group.txt","r")
for g in file:
t =g.replace("\n","")
filename = "/home/data2/EvoGenome/xialf/orthfinder/OrthoFinder/Results_Apr20/24group/%s.fa"%t
seqlist={}
i=0
with open(filename,'r')as r:
lines=r.readlines()
for l in lines:
i+=1
linecontent =l.strip()
if linecontent.startswith('>'):
if i > 1:
name = linecontent[1:]
id =name[0:3]#此处是重点!!!
if id not in seqlist.keys():
seqlist[id]={}
seqlist[id][name]=""
elif i==1:
name =linecontent[1:]
id =name[0:3]#此处是重点!!!
seqlist[id]={}
seqlist[id][name]=""
else:
seqlist[id][name]=seqlist[id][name]+linecontent
maxseq ={}
for k,v in seqlist.items():
d=0
for k1,v1 in v.items():
d +=1
if d ==1:
maxseq[k1]=v1
maxlen =len(v1)
k_last =k1
else:
if len(v1) > maxlen:
maxlen=len(v1)
maxseq[k1]=v1
del maxseq[k_last]
k_last =k1
result1=filename+".max"
result2=filename+".max.list"
result3=filename+".max.len"
with open(result1,'w')as w1:
with open(result2,'w')as w2:
with open(result3,'w')as w3:
for key,value in maxseq.items():
w1.write(">"+key+'\n'+value+"\n")
w2.write(key+'\n')
w3.write(key+'\t'+str(len(value))+"\n")
第一篇文章完成,嘿嘿。