rosalind练习题十八

# Problem

# Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

# An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

# Given: A DNA string s of length at most 1 kbp in FASTA format.

# Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

# Sample Dataset

# >Rosalind_99

# AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG

# Sample Output

# MLLGSFRLIPKETLIQVAGSSPCNLS

# M

# MGMTPRLGLESLLE

# MTPRLGLESLLE

# 给定一个长度不超过1 kbp 的 DNA 序列 s（FASTA 格式），求 s 的所有开放阅读框（ORF）所翻译出的蛋白质序列，其中一个开放阅读框指的是从起始密码子开始（ATG），以终止密码子（TAA、TAG 或 TGA）结尾，中间没有其它终止密码子。

from Bio.Seq import Seq

# 读取 FASTA 文件中的 DNA 序列

dna_seq = Seq("AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG")

# 定义函数，用于将 DNA 序列翻译成蛋白质序列

def translate(dna_seq):

protein_seq = ""

for i in range(0, len(dna_seq) - 2, 3):

codon = dna_seq[i:i + 3]

aa = codon.translate(table=1)

if aa == "*":

break

protein_seq += aa

return protein_seq

# 初始化结果集

orf_set = set()

# 在正向序列和反向互补序列中分别查找可能的 ORF

for seq in [dna_seq, dna_seq.reverse_complement()]:

for i in range(len(seq)):

if seq[i:i + 3] == "ATG":

for j in range(i + 3, len(seq), 3):

if seq[j:j + 3] in {"TAA", "TAG", "TGA"}:

orf = seq[i:j + 3]

protein = translate(orf)

if protein:

orf_set.add(protein)

# 输出结果集中的所有 ORF

for orf in sorted(list(orf_set), key=len, reverse=True):

print(orf)

rosalind练习题十八

推荐阅读更多精彩内容