# Problem
# For a collection of strings, a larger string containing every one of the smaller strings as a substring is called a superstring.
# By the assumption of parsimony, a shortest possible superstring over a collection of reads serves as a candidate chromosome.
# Given: At most 50 DNA strings of approximately equal length, not exceeding 1 kbp, in FASTA format (which represent reads deriving from the same strand of a single linear chromosome).The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.
# Return: A shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).
# Sample Dataset
# >Rosalind_56
# ATTAGACCTG
# >Rosalind_57
# CCTGCCGGAA
# >Rosalind_58
# AGACCTGCCG
# >Rosalind_59
# GCCGGAATAC
# Sample Output
# ATTAGACCTGCCGGAATAC
# 这道题是要我们将给定的DNA序列拼接起来,得到一个包含所有序列的最短超级字符串。可用贪心算法。
# 计算字符串之间的重叠部分长度
def overlap(s, t):
max_len =min(len(s), len(t))
for iin range(max_len, 0, -1):
if s[-i:] == t[:i]:
return i
return 0
# 查找下一个要拼接的字符串
def find_next_string(current_string, unused_strings):
overlap_lengths = [overlap(current_string, s)for sin unused_strings]
next_index = overlap_lengths.index(max(overlap_lengths))
return unused_strings.pop(next_index)
# 使用贪心算法拼接字符串
def shortest_superstring(strings):
superstring = strings.pop(0)
while strings:
next_string = find_next_string(superstring, strings)
overlap_len = overlap(superstring, next_string)
superstring += next_string[overlap_len:]
return superstring
# 读取文件
filename ="input7.txt"
with open(filename, 'r')as f:
lines = [line.strip()for linein f.readlines()]
# 提取字符串列表
strings = []
current_string =''
for linein lines:
if line.startswith('>'):
strings.append(current_string)
current_string =''
else:
current_string += line
strings.append(current_string)
# 寻找最短的超级字符串
superstring = shortest_superstring(strings)
# 输出结果
print(superstring)