生物信息python 入门题 初级至进阶

本文水平有点low,本来不打算写这么低水平的文章,怎奈我的python水平就在这么low的水平。以此贴记录初试python解决生信简单问题的学习过程,也算让自己看清楚自己python到底有多渣。
python平时自用的多,就是非工作场合的自我学习期间就可以简单写点,或者清洗规整一下数据。自我约束不到位,代码日渐奔放,已经越来越像perl的编码习惯了,$a $b $c。。。。。??

题目完整版来自:http://rosalind.info/problems/list-view/;。每个题可能有多种解法,不同解法用分别用## 1/2/3表示。如果你也同我一样刚用python处理生信数据的话,请务必先自己做一遍再看文中代码。

1. 计算序列中各碱基数目

test.txt文件:

GAGCCTACTAACGGGAT
CATCGTAATGACGGCCT
# 初级解法
#!/usr/bin/env python3
nts = {c:0 for c in 'ATGC'}
with open('./test.txt','r') as f:
  for a in f:
    a = a.upper()
    for nt in a.rstrip():
      nts[nt] += 1
print (nts)

# 进阶解题
test = open('test.txt','r').readlines()
[len(re.findall(b,''.join(test))) for b in ['A','T','G','C']]

2. 将DNA序列转化为RNA序列

#初级1
import re
with open('./test.txt','r') as f:
  for line in f:
    line = line.upper()
    RnaSeq = re.sub('T','U',line.rstrip())
print(RnaSeq)
#进阶1
with open('./test.txt','r') as f:
  for line in f:
    line = line.upper()
    print(line.replace('T','U'))
#进阶2
with open('test.txt','r') as f:
    for line in f:
            print(re.sub('T','U',line.rstrip().upper()))

3. 获取序列的反向互补序列

#初级1
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('./test.txt','r') as f:
  for line in f:
    seq = ''
    line = line.upper()
    for aa in line.rstrip():
      seq += trans.get(aa)
     print(seq[::-1])
#初级2
def reverse_complement(seq):
  ntComplement = {'A':'T','T':'A','G':'C','C':'G'}
  RevSeqList = list(reversed(seq))
  RevComSeqList = [ntComplement[k] for k in RevSeqList]
  RevComSeq = ''.join(RevComSeqList)
  return RevComSeq

seq = ''
with open('./test.txt','r') as f:
  for line in f:
    line = line.upper()
print (reverse_complement(line.rstrip()))
# 进阶
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('test.txt','r') as f:
    for line in f:
        tmp_list = list(reversed(line.strip()))
        tmp_list2 = [trans[a] for a in tmp_list]
        print(''.join(tmp_list2))

4. 找出fasta文件中GC含量最大的序列

#初级1
import re
Seq = {}
seqGC = {}
with open('./test.fa','r') as f:
        for line in f:
                if re.match(">",line):
                        SeqName = line[1:]
                        Seq[SeqName] = ''
                        seqGC[SeqName] = 0
                else:
                        line = line.upper()
                        line = line.rstrip()
                        Seq[SeqName] += line
                        seqGC[SeqName] += line.count('G')
                        seqGC[SeqName] += line.count('C')
maxGC = 0
for key , value in Seq.items():
        if maxGC < float(seqGC[key]/ len(value)*100):
                maxGC = float(seqGC[key] / len(value)*100)
                tmp = key
print ('>'+tmp+Seq[tmp])
# 进阶初版
from operator import itemgetter
from collections import OrderedDict
SeqTest = OrderedDict()
GcContent = OrderedDict()
with open('./test.fa','r') as f:
        for line in f:
                line = line.rstrip()
                if line.startswith('>'):
                        SeqName = line[1:]
                        SeqTest[SeqName] = ''
                        continue
                SeqTest[SeqName] += line.upper()

for key, value in SeqTest.items():
        totalLength = len(value)
        gcNum = value.count('G') + value.count('C')
        gcContent[key] = float(gcNum/totalLength)*100
sortedGC = sorted(gcContent.items(),key = itemgetter(1))
largeName = sortedGC[-1][0]
largeGCcontent = sortedGC[-1][1]
print ('most GC ratio rate is %s and it is %s ' %(largeName,largeGCContent))
# 进阶高板
seq = {}
seq_gc = {}
with open('test.fa','r') as f:
    for line in f:
        line = line.strip()
        if re.match('>',line):
            seq_id = re.sub('>', '',line)
            seq[seq_id] = ''
        else:
            seq[seq_id] = seq[seq_id] + line.upper()

for key,s in seq.items():
    seq_gc[key] = (s.count('G') + s.count('C'))/len(s)
    print(max(seq_gc,key=seq_gc.get) + '\t' + max(seq_gc.values()))

5. 计算点突变数目

给两个长度为t的序列s,t和s之间的哈明距离(Hamming distance)定义为dH(s,t)。该问题即返回两条序列的哈明距离。


image
## 初版
fh = open('./test.txt','r')
lst = []
for line in fh:
        lst.append(line.rstrip())
hamming_dis = 0
for i in range(len(lst[0])):
        if lst[0][i] == lst[1][i]:
                continue
        hamming_dis += 1
print (hamming_dis)
#进阶1
fh = open('./test.txt','r')
seq = file.readlines()
seq1, seq2 = seq[0].strip(), seq[1].strip()
mutation = [i for i in range(len(seq1)) if seq1[i] != seq2[i]]
print (len(mutation))
#进阶2 
import numpy as np
import pandas as pd
sequence = []
number = 0
with open('test.txt','r') as f:
    for line in f:
        sequence.append(list(line.strip().upper()))
frame = pd.DataFrame(np.asarray(sequence))
for i in range(len(frame.columns)):
    if frame[i][0] != frame[i][1]:
        number = number + 1
print(number)

6. 孟德尔第一定理

一个群体中有三种基因型的生物:k,显性纯合子;m,杂合子;n,隐性纯合子。假设这对形状由一对等位基因控制,且群体中随机选取的任何两个个体都能交配,求随机选取两个个体交配后,子代拥有显性等位基因的概率。

#初版
k = int(input("enter the number of homozygous dominant: "))
m = int(input("enter the number of heterozygous: "))
n = int(input("enter the number of homozygous recessive: "))

num = int(k + m + n)
choice = num*(num-1)/2.0
p = 1 - (n*(n-1)/2 + 0.25*m*(m-1)/2 + m*n*0.5)/choice
print(p)
#进阶1
from scipy.misc import comb
num = input("Number of individuals(k,m,n): ")
[k,m,n] = map(int,num.split(','))
t = k + m + n
rr = comb(n,2)/comb(t,2)
hh = comb(m,2)/comb(t,2)
hr = comb(n,1)*comb(m,1)/comb(t,2)

p = 1 - (rr+hh*1/4+hr*1/2)
print(p)

7. 将RNA翻译成蛋白质

def translate_rna(sequence):
    codonTable = {
    'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACU':'T',
    'AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R',
    'CUA':'L', 'CUC':'L', 'CUG':'L', 'CUU':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P',
    'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGU':'R',
    'GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A',
    'GAC':'D', 'GAU':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G',
    'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S',
    'UUC':'F', 'UUU':'F', 'UUA':'L', 'UUG':'L',
    'UAC':'Y', 'UAU':'Y', 'UAA':'', 'UAG':'',
    'UGC':'C', 'UGU':'C', 'UGA':'', 'UGG':'W',
    }
    proteinsequence = ''
    for n in range(0,len(sequence),3):
        if sequence[n:n+3] in codonTable.keys():
            proteinsequence += codonTable[sequence[n:n+3]]
    return proteinsequence


protein_fh = open('./protein.txt','w')
with open('./rna.txt','r') as f:
        for line in f:
                protein_fh.write(translate_rna(line.strip('\n')))
## 进阶
import re
from collections import OrderedDict

codonTable = OrderedDict()
codonTable={
'AUA':'I','AUC':'I','AUU':'I','AUG':'M',
'ACA':'T','ACC':'T','ACG':'T','ACU':'T',
'AAC':'N','AAU':'N','AAA':'K','AAG':'K',
'AGC':'S','AGU':'S','AGA':'R','AGG':'R',
'CUA':'L','CUC':'L','CUG':'L','CUU':'L',
'CCA':'P','CCC':'P','CCG':'P','CCU':'P',
'CAC':'H','CAU':'H','CAA':'Q','CAG':'Q',
'CGA':'R','CGC':'R','CGG':'R','CGU':'R',
'GUA':'V','GUC':'V','GUG':'V','GUU':'V',
'GCA':'A','GCC':'A','GCG':'A','GCU':'A',
'GAC':'D','GAU':'D','GAA':'E','GAG':'E',
'GGA':'G','GGC':'G','GGG':'G','GGU':'G',
'UCA':'S','UCC':'S','UCG':'S','UCU':'S',
'UUC':'F','UUU':'F','UUA':'L','UUG':'L',
'UAC':'Y','UAU':'Y','UAA':'','UAG':'',
'UGC':'C','UGU':'C','UGA':'','UGG':'W',
}

rnaseq = ''
with open('./rna.txt','r') as f:
        for line in f:
                line = line.rstrip()
                line += line.upper()

aminoAcids = []
i = 0
while i < len(rnaseq):
        condon = rnaseq[i:i+3]
        if codonTable[condon] != '':
                aminoAcids.append(codonTable[condon])
        i += 3

peptide = ''.join(aminoAcids)
print(peptide)
## 3
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna,generic_rna

# translate
rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna)
print(rna.translate())

8. 寻找DNA motif

## 1 
seq = 'GATATATGCATATACTT'
motif = 'ATAT'
motif_len = len(motif)
position = []
for i in range(len(seq)-motif_len):
        if seq[i:i+motif_len] == motif:
                position.append(i+1)
print(position)
## 2
import re
seq = 'GATATATGCATATACTT'
print([i.start()+1 for i in re.finditer('(?=ATAT)',seq)])

9. 多个等长序列的一致性序列

比如序列如下:

>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

各位点碱基个数:

A   5 1 0 0 5 5 0 0
C   0 0 1 4 2 0 6 1
G   1 1 6 3 0 1 0 0
T   1 5 0 0 0 1 1 6
Consensus   A T G C A A C T
#1 初阶
def seq_list(fasta):
        seq_list = []
        for line in fasta.readlines():
                if not line.startswith('>'):
                        seq = list(line.rstrip())
                        seq_list.append(seq)
        return seq_list
def statistic_base(seq_list):
        for base in 'ATGC':
                base_total = []
                for sit in range(len(seq_list[0])):
                        col = [x[sit] for x in seq_list]
                        num = col.count(base)
                        base_total.append(num)
                print('%s:%s'%(base,base_total))
fh =  open('./test.fa','r')
sequence_list = seq_list(fh)
statistic_base(sequence_list)
# 进阶 初级
from collections import Counter
from collections import OrderedDict
seq = OrderedDict()
seqLength = 0
fh = open('./test.consensus.txt','wt')

with open('./test.fa','r') as f:
        for line in f:
                if line.startswith('>'):
                        seq_name = line.rstrip()
                        seq[seq_name] = ''
                        continue
                seq[seq_name] += line.upper().rstrip()
        seqLength = len(seq[seq_name])

a,t,g,c = [],[],[],[]
consensus = ''
for i in range(seqLength):
        sequence = ''
        for j in seq.keys():
                sequence += seq[j][i]
        a.append(sequence.count('A'))
        t.append(sequence.count('T'))
        g.append(sequence.count('G'))
        c.append(sequence.count('C'))
        counts = Counter(sequence)
        consensus += counts.most_common()[0][0]
fh.write(consensus+'\n')
fh.write('\n'.join(['A:\t'+'\t'.join(map(str,a)),'C:\t'+'\t'.join(map(str,c)),'G:\t'+'\t'.join(map(str,g)),'T:\t'+'\t'.join(map(str,t))])+'\n')
fh.close()

# 进阶高级
seq = []
seq_dict = {}
with open('t.list','r') as f:
    for line in f:
        if not line.startswith('>'):
            seq.append(list(line.strip().upper()))          
for aa in ['A','T','G','C']:
    seq_dict[aa] = [list(frame[i]).count(aa) for i in range(len(frame.columns))]
     print(''.join(list(pd.DataFrame(seq).idxmax(axis=1))))

10. 致命的斐波那契兔子

斐波那契序列是一个序列的数字定义的递归关系Fn = Fn-1+ Fn−2 ,我们设置的起始值F1 = F2 = 1。
假设每只兔子可以活m个月,n个月后有多少只兔子?

## 1
def fib(n,m):
        f= [0,1,1]
        for i in range(3,n+1):
                if i <= m:
                        total = f[i-1] + f[i-2]
                elif i == m+1:
                        total = f[i-1] + f[i-2] - 1
                else:
                        total = f[i-1] + f[i-2] - f[i-m-1]
                f.append(total)
        return(f[n])

inp = input('live month of rabbit(m),and afther n-th month;n<=100,m<=20;input(n,m): ')
[n,m]=map(int,inp.split(','))

print(fib(n,m))

11. Graph Theory

文件介绍好麻烦,自己看:http://rosalind.info/problems/grph/
总之该题有三个碱基的首尾相同就连接起来,
输入文件:

>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG

输出结果:

Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
seq = {}
with open('./overlap.fa','r') as f:
        for line in f:
                line = line.rstrip()
                if line.startswith('>'):
                        seqname = line[1:]
                        seq[seqname] = ''
                        continue
                seq[seqname] += line.upper()

for key , value in seq.items():
        for key2 ,value2 in seq.items():
                if key != key2 and value[-3:] == value2[:3]:
                        print(key+'\t'+key2)

12. 计算后代的期望值

同样懒得解释原理,具体原理看:http://rosalind.info/problems/iev/
现在有6种基因型组合夫妇:

AA-AA
AA-Aa
AA-aa
Aa-Aa
Aa-aa
aa-aa

给定6个非负整数,代表6种基因型组合的夫妇数量,求下一代显性性状的个数,假设每对夫妻有2个孩子。

def expected(a,b,c,d,f,g):
        AA_AA = 1
        AA_Aa = 1
        AA_aa = 1
        Aa_Aa = 0.75
        Aa_aa = 0.5
        aa_aa = 0
        p = (AA_AA*a + AA_Aa*b + AA_aa*c + Aa_Aa*d + Aa_aa*f + aa_aa*g)*2
        return (p)

inp = input('input(a,b,c,d,f,g): ')
[a,b,c,d,f,g] = map(int,inp.split(','))
print(expected(a,b,c,d,f,g))

13. 计算序列间的最长一致性序列,即寻找序列间公有的motif(Finding a Shared Motif)

# 测试数据
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA

# 输出结果:
AC
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,463评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,868评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,213评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,666评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,759评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,725评论 1 294
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,716评论 3 415
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,484评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,928评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,233评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,393评论 1 345
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,073评论 5 340
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,718评论 3 324
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,308评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,538评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,338评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,260评论 2 352

推荐阅读更多精彩内容