rosalind是一个生信学习网站(网址为[http://rosalind.info/problems/list-view/]),上面有许多生物信息相关的题目,用python解决它。这些代码是我个人学习时写的,也有些是在网上看到的(比如生信技能树,简书等)。每道题可能有多种解法,我以#1,#2,#3表示。(注意一下代码对齐问题)
rosalind程序问题解决
1. Counting DNA Nucleotides 碱基记数
``` python
#!/usr/bin/env python3
# counting DNA nucleotides##1
cnts = {c:0 for c in 'ATCG'}
for line in open("dna nuclear.txt",'r'):
for c in line.rstrip():
cnts[c] += 1
print(cnts)
#2
with open('dna nuclear.txt', 'r') as f:
data = f.read().strip('\n')
#count() 方法用于统计字符串里某个字符出现的次数。
#str.count(sub, start= 0,end=len(string))
list = ([data.count(c) for c in 'ACGT'])
for i in list:
print(i, end= " ")
#3 用字典的方法
dic={}
with open('dna nuclear.txt', 'r') as f:
data = f.read().strip('\n')
for base in data:
# get() 函数返回指定键的值,如果值不在字典中返回默认值。
# dict.get(key, default=None)
dic[base]=1+ dic.get(base,0)
for key in sorted(dic.keys()):
print(dic[key], end =" ")
```
2.Transcribing DNA into RNA 将DNA转为RNA。
(即用"U"替换"T")
#1
with open('rosalind_rna.txt', 'r') as f:
data = f.read().strip('\n')
print(data.replace("T", "U"))
#2
with open('rosalind_rna.txt','r') as f:
for line in f:
line = line.upper()
RnaSeq = re.sub('T','U',line.rstrip())
print(RnaSeq)
3.Complementing a Strand of DNA DNA互补链
#1
##利用大小写的不同,非常巧妙!!!
with open('rosalind_revc.txt', 'r') as f:
data = f.read().strip('\n')
st = data.replace('A', 't').replace('T', 'a').replace('C', 'g').replace('G', 'c').upper()[::-1]
print(st)
#2
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('rosalind_revc.txt','r') as f:
for line in f:
seq = ''
line = line.upper()
for aa in line.rstrip():
seq += trans.get(aa)
print(seq[::-1])
#3
def reverse_complement(seq):
ntComplement = {'A':'T','T':'A','G':'C','C':'G'}
RevSeqList = list(reversed(seq))
RevComSeqList = [ntComplement[k] for k in RevSeqList]
RevComSeq = ''.join(RevComSeqList)
return RevComSeq
seq = ''
with open('rosalind_revc.txt','r') as f:
for line in f:
line = line.upper()
print (reverse_complement(line.rstrip()))
4. Rabbits and Recurrence Relations 兔子问题,斐波那契数列的一点改变
递归函数的使用
n=input("n is:\r") #公式为Fn = Fn-1 + k * Fn-2
k=input("k is:\r") #\n代表换行
fn =[1,1];
bool_list=[1,1,0];
for i in range(2,int(n)):
if(bool_list[i]==0):
fn.append(fn[i-1]+int(k)*fn[i-2])
bool_list[i]=1
bool_list.append(0)
i+=1;
else:
bool_list.append(0)
i+=1;
print(fn[int(n)-1])
5.Computing GC Content fasta文件中GC含量最大的序列
## 1
import re
Seq = {}
seqGC = {}
with open('./test.fa','r') as f:
for line in f:
if re.match(">",line):
SeqName = line[1:]
Seq[SeqName] = ''
seqGC[SeqName] = 0
else:
line = line.upper()
line = line.rstrip()
Seq[SeqName] += line
seqGC[SeqName] += line.count('G')
seqGC[SeqName] += line.count('C')
maxGC = 0
for key , value in Seq.items():
if maxGC < float(seqGC[key]/ len(value)*100):
maxGC = float(seqGC[key] / len(value)*100)
tmp = key
print ('>'+tmp+Seq[tmp])
print (tmp)
print(float(maxGC))
## 2
from operator import itemgetter
from collections import OrderedDict
SeqTest = OrderedDict()
GcContent = OrderedDict()
with open('./test.fa','r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
SeqName = line[1:]
SeqTest[SeqName] = ''
continue
SeqTest[SeqName] += line.upper()
for key, value in SeqTest.items():
totalLength = len(value)
gcNum = value.count('G') + value.count('C')
gcContent[key] = float(gcNum/totalLength)*100
sortedGC = sorted(gcContent.items(),key = itemgetter(1))
largeName = sortedGC[-1][0]
largeGCcontent = sortedGC[-1][1]
print ('most GC ratio rate is %s and it is %s ' %(largeName,largeGCContent))
6.Counting Point Mutations 计算点突变数目
## 1
fh = open('rosalind_hamm.txt','r')
lst = []
for line in fh:
lst.append(line.rstrip())
hamming_dis = 0
for i in range(len(lst[0])):
if lst[0][i] == lst[1][i]:
continue
hamming_dis += 1
print (hamming_dis)
## 2
fh = open('rosalind_hamm.txt','r')
seq = file.readlines()
seq1, seq2 = seq[0].strip(), seq[1].strip()
mutation = [i for i in range(len(seq1)) if seq1[i] != seq2[i]]
print (len(mutation))
7.Mendel's First Law 孟德尔第一定理
一个群体中有三种基因型的生物:k,显性纯合子;m,杂合子;n,隐性纯合子。假设这对形状由一对等位基因控制,且群体中随机选取的任何两个个体都能交配,求随机选取两个个体交配后,子代拥有显性等位基因的概率。
## 1
k =22
m = 25
n = 16
num = int(k + m + n)
choice = num*(num-1)/2.0
p = 1 - (n*(n-1)/2 + 0.25*m*(m-1)/2 + m*n*0.5)/choice
print(p)
## 2
from scipy.misc import comb
num = input("Number of individuals(k,m,n): ")
[k,m,n] = map(int,num.split(','))
t = k + m + n
rr = comb(n,2)/comb(t,2)
hh = comb(m,2)/comb(t,2)
hr = comb(n,1)*comb(m,1)/comb(t,2)
p = 1 - (rr+hh*1/4+hr*1/2)
print(p)
8. Translating RNA into Protein 将RNA翻译成蛋白质
def translate_rna(sequence):
codonTable = {
'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACU':'T',
'AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R',
'CUA':'L', 'CUC':'L', 'CUG':'L', 'CUU':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P',
'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGU':'R',
'GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A',
'GAC':'D', 'GAU':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G',
'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S',
'UUC':'F', 'UUU':'F', 'UUA':'L', 'UUG':'L',
'UAC':'Y', 'UAU':'Y', 'UAA':'', 'UAG':'',
'UGC':'C', 'UGU':'C', 'UGA':'', 'UGG':'W',
}
proteinsequence = ''
for n in range(0,len(sequence),3):
if sequence[n:n+3] in codonTable.keys():
proteinsequence += codonTable[sequence[n:n+3]]
return proteinsequence
protein_fh = open('./protein.txt','w')
with open('./rna.txt','r') as f:
for line in f:
protein_fh.write(translate_rna(line.strip('\n')))
## 2
import re
from collections import OrderedDict
codonTable = OrderedDict()
codonTable={
'AUA':'I','AUC':'I','AUU':'I','AUG':'M',
'ACA':'T','ACC':'T','ACG':'T','ACU':'T',
'AAC':'N','AAU':'N','AAA':'K','AAG':'K',
'AGC':'S','AGU':'S','AGA':'R','AGG':'R',
'CUA':'L','CUC':'L','CUG':'L','CUU':'L',
'CCA':'P','CCC':'P','CCG':'P','CCU':'P',
'CAC':'H','CAU':'H','CAA':'Q','CAG':'Q',
'CGA':'R','CGC':'R','CGG':'R','CGU':'R',
'GUA':'V','GUC':'V','GUG':'V','GUU':'V',
'GCA':'A','GCC':'A','GCG':'A','GCU':'A',
'GAC':'D','GAU':'D','GAA':'E','GAG':'E',
'GGA':'G','GGC':'G','GGG':'G','GGU':'G',
'UCA':'S','UCC':'S','UCG':'S','UCU':'S',
'UUC':'F','UUU':'F','UUA':'L','UUG':'L',
'UAC':'Y','UAU':'Y','UAA':'','UAG':'',
'UGC':'C','UGU':'C','UGA':'','UGG':'W',
}
rnaseq = ''
with open('./rna.txt','r') as f:
for line in f:
line = line.rstrip()
line += line.upper()
aminoAcids = []
i = 0
while i < len(rnaseq):
condon = rnaseq[i:i+3]
if codonTable[condon] != '':
aminoAcids.append(codonTable[condon])
i += 3
peptide = ''.join(aminoAcids)
print(peptide)
## 3
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna,generic_rna
# translate
rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna)
print(rna.translate())
9. Finding a Motif in DNA 寻找DNA motif
## 1
seq = 'GATATATGCATATACTT'
motif = 'ATAT'
motif_len = len(motif)
position = []
for i in range(len(seq)-motif_len):
if seq[i:i+motif_len] == motif:
position.append(i+1)
print(position)
## 2
import re
seq = 'GATATATGCATATACTT'
print([i.start()+1 for i in re.finditer('(?=ATAT)',seq)])
10. Consensus and Profile 多个等长序列的一致性序列
比如序列如下:
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
各位点碱基个数:
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Consensus A T G C A A C T
## 1
def seq_list(fasta):
seq_list = []
for line in fasta.readlines():
if not line.startswith('>'):
seq = list(line.rstrip())
seq_list.append(seq)
return seq_list
def statistic_base(seq_list):
for base in 'ATGC':
base_total = []
for sit in range(len(seq_list[0])):
col = [x[sit] for x in seq_list]
num = col.count(base)
base_total.append(num)
print('%s:%s'%(base,base_total))
fh = open('./test.fa','r')
sequence_list = seq_list(fh)
statistic_base(sequence_list)
## 2
from collections import Counter
from collections import OrderedDict
seq = OrderedDict()
seqLength = 0
fh = open('./test.consensus.txt','wt')
with open('./test.fa','r') as f:
for line in f:
if line.startswith('>'):
seq_name = line.rstrip()
seq[seq_name] = ''
continue
seq[seq_name] += line.upper().rstrip()
seqLength = len(seq[seq_name])
a,t,g,c = [],[],[],[]
consensus = ''
for i in range(seqLength):
sequence = ''
for j in seq.keys():
sequence += seq[j][i]
a.append(sequence.count('A'))
t.append(sequence.count('T'))
g.append(sequence.count('G'))
c.append(sequence.count('C'))
counts = Counter(sequence)
consensus += counts.most_common()[0][0]
fh.write(consensus+'\n')
fh.write('\n'.join(['A:\t'+'\t'.join(map(str,a)),'C:\t'+'\t'.join(map(str,c)),'G:\t'+'\t'.join(map(str,g)),'T:\t'+'\t'.join(map(str,t))])+'\n')
fh.close()
11. Mortal Fibonacci Rabbits
斐波那契序列是一个序列的数字定义的递归关系Fn = Fn-1+ Fn−2 ,我们设置的起始值F1 = F2 = 1。
假设每只兔子可以活m个月,n个月后有多少只兔子?
## 1
def fib(n,m):
f= [0,1,1]
for i in range(3,n+1):
if i <= m:
total = f[i-1] + f[i-2]
elif i == m+1:
total = f[i-1] + f[i-2] - 1
else:
total = f[i-1] + f[i-2] - f[i-m-1]
f.append(total)
return(f[n])
inp = input('live month of rabbit(m),and afther n-th month;n<=100,m<=20;input(n,m): ')
[n,m]=map(int,inp.split(','))
print(fib(n,m))
12. Overlap Graphs Graph Theory
文件介绍好麻烦,自己看:http://rosalind.info/problems/grph/
总之该题有三个碱基的首尾相同就连接起来,
输入文件:
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
输出结果:
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
seq = {}
with open('./overlap.fa','r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
seqname = line[1:]
seq[seqname] = ''
continue
seq[seqname] += line.upper()
for key , value in seq.items():
for key2 ,value2 in seq.items():
if key != key2 and value[-3:] == value2[:3]:
print(key+'\t'+key2)
13. Calculating Expected Offspring 计算后代的期望值
同样懒得解释原理,具体原理看:http://rosalind.info/problems/iev/
现在有6种基因型组合夫妇:
AA-AA
AA-Aa
AA-aa
Aa-Aa
Aa-aa
aa-aa
给定6个非负整数,代表6种基因型组合的夫妇数量,求下一代显性性状的个数,假设每对夫妻有2个孩子。
def expected(a,b,c,d,f,g):
AA_AA = 1
AA_Aa = 1
AA_aa = 1
Aa_Aa = 0.75
Aa_aa = 0.5
aa_aa = 0
p = (AA_AA*a + AA_Aa*b + AA_aa*c + Aa_Aa*d + Aa_aa*f + aa_aa*g)*2
return (p)
inp = input('input(a,b,c,d,f,g): ')
[a,b,c,d,f,g] = map(int,inp.split(','))
print(expected(a,b,c,d,f,g))
14. Finding a Shared Motif
#待填坑
##15\.Independent Alleles
## 16\. Finding a Protein Motif
##17\.Inferring mRNA from Protein
## 18\.Open Reading Frames
##19\.Enumerating Gene Orders
PRTM Calculating Protein Mass
REVP Locating Restriction Sites
SPLC RNA Splicing
LEXF Enumerating k-mers Lexicographically
LGIS Longest Increasing Subsequence
LONG Genome Assembly as Shortest Superstring
PMCH Perfect Matchings and RNA Secondary Structures 2053
PPER Partial Permutations 2886
PROB Introduction to Random Strings 2821
SIGN Enumerating Oriented Gene Orderings 2924
SSEQ Finding a Spliced Motif 3123
TRAN Transitions and Transversions 2988
TREE Completing a Tree 2561
CAT Catalan Numbers and RNA Secondary Structures 866
CORR Error Correction in Reads 1411
INOD Counting Phylogenetic Ancestors 1901
KMER k-Mer Composition 2135
KMP Speeding Up Motif Finding 1714
LCSQ Finding a Shared Spliced Motif 1468
LEXV Ordering Strings of Varying Length Lexicographically
LCSQ Finding a Shared Spliced Motif 1468
LEXV Ordering Strings of Varying Length Lexicographically 2393
MMCH Maximum Matchings and RNA Secondary Structures 1044
PDST Creating a Distance Matrix 1543
REAR Reversal Distance 767
RSTR Matching Random Motifs 1174
SSET Counting Subsets 1808
ASPC Introduction to Alternative Splicing 1142
EDIT Edit Distance 1097
EVAL Expected Number of Restriction Sites 892
MOTZ Motzkin Numbers and RNA Secondary Structures 560
NWCK Distances in Trees 699
SCSP Interleaving Two Motifs 676
SETO Introduction to Set Operations 1440
SORT Sorting by Reversals 588
SPEC Inferring Protein from Spectrum 1120
TRIE Introduction to Pattern Matching 868
CONV Comparing Spectra with the Spectral Convolution 722
CTBL Creating a Character Table 404
DBRU Constructing a De Bruijn Graph 738
EDTA Edit Distance Alignment 716
FULL Inferring Peptide from Full Spectrum 508
INDC Independent Segregation of Chromosomes 582
ITWV Finding Disjoint Motifs in a Gene 274
LREP Finding the Longest Multiple Repeat 400
NKEW Newick Format with Edge Weights 436
RNAS Wobble Bonding and RNA Secondary Structures 395
AFRQ Counting Disease Carriers 494
CSTR Creating a Character Table from Genetic Strings 257
CTEA Counting Optimal Alignments
| UNR | [Counting Unrooted Binary Trees](http://rosalind.info/problems/cunr/) | [251](http://rosalind.info/problems/cunr/recent/) | | | | |
| GLOB | [Global Alignment with Scoring Matrix](http://rosalind.info/problems/glob/) | [510](http://rosalind.info/problems/glob/recent/) | | | | |
| PCOV | [Genome Assembly with Perfect Coverage](http://rosalind.info/problems/pcov/) | [528](http://rosalind.info/problems/pcov/recent/) | | | | |
| PRSM | [Matching a Spectrum to a Protein](http://rosalind.info/problems/prsm/) | [390](http://rosalind.info/problems/prsm/recent/) | | | | |
| QRT | [Quartets](http://rosalind.info/problems/qrt/) | [208](http://rosalind.info/problems/qrt/recent/) | | | | |
| SGRA | [Using the Spectrum Graph to Infer Peptides](http://rosalind.info/problems/sgra/) | [336](http://rosalind.info/problems/sgra/recent/) | | | | |
| SUFF | [Encoding Suffix Trees](http://rosalind.info/problems/suff/) | [289](http://rosalind.info/problems/suff/recent/) | | | | |
| CHBP | [Character-Based Phylogeny](http://rosalind.info/problems/chbp/) | [139](http://rosalind.info/problems/chbp/recent/) | | | | |
| CNTQ | [Counting Quartets](http://rosalind.info/problems/cntq/) | [147](http://rosalind.info/problems/cntq/recent/) | | | | |
| EUBT | [Enumerating Unrooted Binary Trees](http://rosalind.info/problems/eubt/) | [135](http://rosalind.info/problems/eubt/recent/) | | | | |
| GASM | [Genome Assembly Using Reads](http://rosalind.info/problems/gasm/) | [289](http://rosalind.info/problems/gasm/recent/) | | | | |
| GCON | [Global Alignment with Constant Gap Penalty](http://rosalind.info/problems/gcon/) | [331](http://rosalind.info/problems/gcon/recent/) | | | | |
| LING | [Linguistic Complexity of a Genome](http://rosalind.info/problems/ling/) | [174](http://rosalind.info/problems/ling/recent/) | | | | |
| LOCA | [Local Alignment with Scoring Matrix](http://rosalind.info/problems/loca/) | [344](http://rosalind.info/problems/loca/recent/) | | | | |
| MEND | [Inferring Genotype from a Pedigree](http://rosalind.info/problems/mend/) | [225](http://rosalind.info/problems/mend/recent/) | | | | |
| MGAP | [Maximizing the Gap Symbols of an Optimal Alignment](http://rosalind.info/problems/mgap/) | [182](http://rosalind.info/problems/mgap/recent/) | | | | |
| MREP | [Identifying Maximal Repeats](http://rosalind.info/problems/mrep/) | [155](http://rosalind.info/problems/mrep/recent/) | | | | |
| MULT | [Multiple Alignment](http://rosalind.info/problems/mult/) | [185](http://rosalind.info/problems/mult/recent/) | | | | |
| PDPL | [Creating a Restriction Map](http://rosalind.info/problems/pdpl/) | [203](http://rosalind.info/problems/pdpl/recent/) | | | | |
| ROOT | [Counting Rooted Binary Trees](http://rosalind.info/problems/root/) | [215](http://rosalind.info/problems/root/recent/) | | | | |
| SEXL | [Sex-Linked Inheritance](http://rosalind.info/problems/sexl/) | [378](http://rosalind.info/problems/sexl/recent/) | | | | |
| SPTD | [Phylogeny Comparison with Split Distance](http://rosalind.info/problems/sptd/) | [152](http://rosalind.info/problems/sptd/recent/) | | | | |
| WFMD | [The Wright-Fisher Model of Genetic Drift](http://rosalind.info/problems/wfmd/) | [297](http://rosalind.info/problems/wfmd/recent/) | | | | |
| ALPH | [Alignment-Based Phylogeny](http://rosalind.info/problems/alph/) | [100](http://rosalind.info/problems/alph/recent/) | | | | |
| ASMQ | [Assessing Assembly Quality with N50 and N75](http://rosalind.info/problems/asmq/) | [230](http://rosalind.info/problems/asmq/recent/) | | | | |
| CSET | [Fixing an Inconsistent Character Set](http://rosalind.info/problems/cset/) | [115](http://rosalind.info/problems/cset/recent/) | | | | |
| EBIN | [Wright-Fisher's Expected Behavior](http://rosalind.info/problems/ebin/) | [248](http://rosalind.info/problems/ebin/recent/) | | | | |
| FOUN | [The Founder Effect and Genetic Drift](http://rosalind.info/problems/foun/) | [233](http://rosalind.info/problems/foun/recent/) | | | | |
| GAFF | [Global Alignment with Scoring Matrix and Affine Gap Penalty](http://rosalind.info/problems/gaff/) | [277](http://rosalind.info/problems/gaff/recent/) | | | | |
| GREP | [Genome Assembly with Perfect Coverage and Repeats](http://rosalind.info/problems/grep/) | [176](http://rosalind.info/problems/grep/recent/) | | | | |
| OAP | [Overlap Alignment](http://rosalind.info/problems/oap/) | [155](http://rosalind.info/problems/oap/recent/) | | | | |
| QRTD | [Quartet Distance](http://rosalind.info/problems/qrtd/) | [73](http://rosalind.info/problems/qrtd/recent/) | | | | |
| SIMS | [Finding a Motif with Modifications](http://rosalind.info/problems/sims/) | [191](http://rosalind.info/problems/sims/recent/) | | | | |
| SMGB | [Semiglobal Alignment](http://rosalind.info/problems/smgb/) | [159](http://rosalind.info/problems/smgb/recent/) | | | | |
| KSIM | [Finding All Similar Motifs](http://rosalind.info/problems/ksim/) | [68](http://rosalind.info/problems/ksim/recent/) | | | | |
| LAFF | [Local Alignment with Affine Gap Penalty](http://rosalind.info/problems/laff/) | [162](http://rosalind.info/problems/laff/recent/) | | | | |
| OSYM | [Isolating Symbols in Alignments](http://rosalind.info/problems/osym/) | [114](http://rosalind.info/problems/osym/recent/) | | | | |
| RSUB | [Identifying Reversing Substitutions](http://rosalind.info/problems/rsub/) |
~~