rosalind练习题十八

# Problem

# Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

# An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

# Given: A DNA string s of length at most 1 kbp in FASTA format.

# Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

# Sample Dataset

# >Rosalind_99

# AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG

# Sample Output

# MLLGSFRLIPKETLIQVAGSSPCNLS

# M

# MGMTPRLGLESLLE

# MTPRLGLESLLE

# 给定一个长度不超过1 kbp 的 DNA 序列 s(FASTA 格式),求 s 的所有开放阅读框(ORF)所翻译出的蛋白质序列,其中一个开放阅读框指的是从起始密码子开始(ATG),以终止密码子(TAA、TAG 或 TGA)结尾,中间没有其它终止密码子。

from Bio.Seq import Seq

# 读取 FASTA 文件中的 DNA 序列

dna_seq = Seq("AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG")

# 定义函数,用于将 DNA 序列翻译成蛋白质序列

def translate(dna_seq):

    protein_seq = ""

    for i in range(0, len(dna_seq) - 2, 3):

        codon = dna_seq[i:i + 3]

        aa = codon.translate(table=1)

        if aa == "*":

            break

        protein_seq += aa

    return protein_seq

# 初始化结果集

orf_set = set()

# 在正向序列和反向互补序列中分别查找可能的 ORF

for seq in [dna_seq, dna_seq.reverse_complement()]:

    for i in range(len(seq)):

        if seq[i:i + 3] == "ATG":

            for j in range(i + 3, len(seq), 3):

                if seq[j:j + 3] in {"TAA", "TAG", "TGA"}:

                    orf = seq[i:j + 3]

                    protein = translate(orf)

                    if protein:

                        orf_set.add(protein)

# 输出结果集中的所有 ORF

for orf in sorted(list(orf_set), key=len, reverse=True):

    print(orf)

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容