# Problem
# Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.
# An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.
# Given: A DNA string s of length at most 1 kbp in FASTA format.
# Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.
# Sample Dataset
# >Rosalind_99
# AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
# Sample Output
# MLLGSFRLIPKETLIQVAGSSPCNLS
# M
# MGMTPRLGLESLLE
# MTPRLGLESLLE
# 给定一个长度不超过1 kbp 的 DNA 序列 s(FASTA 格式),求 s 的所有开放阅读框(ORF)所翻译出的蛋白质序列,其中一个开放阅读框指的是从起始密码子开始(ATG),以终止密码子(TAA、TAG 或 TGA)结尾,中间没有其它终止密码子。
from Bio.Seq import Seq
# 读取 FASTA 文件中的 DNA 序列
dna_seq = Seq("AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG")
# 定义函数,用于将 DNA 序列翻译成蛋白质序列
def translate(dna_seq):
protein_seq = ""
for i in range(0, len(dna_seq) - 2, 3):
codon = dna_seq[i:i + 3]
aa = codon.translate(table=1)
if aa == "*":
break
protein_seq += aa
return protein_seq
# 初始化结果集
orf_set = set()
# 在正向序列和反向互补序列中分别查找可能的 ORF
for seq in [dna_seq, dna_seq.reverse_complement()]:
for i in range(len(seq)):
if seq[i:i + 3] == "ATG":
for j in range(i + 3, len(seq), 3):
if seq[j:j + 3] in {"TAA", "TAG", "TGA"}:
orf = seq[i:j + 3]
protein = translate(orf)
if protein:
orf_set.add(protein)
# 输出结果集中的所有 ORF
for orf in sorted(list(orf_set), key=len, reverse=True):
print(orf)