其实Python的语法也看了好一阵子,但一直没有实战,但是今天上完多序列比对,突然就想试试Python的实战,没想到上手这的是快。
随便先写个貌似蛋白序列的字符串
>>> proseq = 'AWJFBAJDKAJNFKAFLMMALFWSHFBJSBFKJANDKAJNDKJQNKJQK'
Python的索引功能
>>>'MEafw' [1]
'E'
>>> 'MEafw' [0]
'M'
>>> 'MEafw' [-1]
'w'
>>> proseq [1]
'W'
这里要注意的是Python的计数原则是从0开始的,故0对应着的是第一个字符,此外负号代表的从后往前计数
切片
>>> proseq [0:3]
'AWJ'
>>> proseq [3:]
'FBAJDKAJNFKAFLMMALFWSHFBJSBFKJANDKAJNDKJQNKJQK'
切片的功能类似索引,不过就是截取一段范围而已,中间用“:”表示区间范围,如果不加位置,则直接到最后
字符串运算
>>> 'pro' * 2
'propro'
>>> 'pro' + 'pro'
'propro'
有意思的是,在Python中的字符串并不仅仅限于数字,字符串也是可以运算的
>>> len(proseq)
49
len()函数可以完成确定字符串长度
>>> proseq.count('A')
7
.count()函数可进行字符计算相关工作
为什么len()则前置,而.count()的命令后置?
因为这是是否为内置函数的区别,如len()为内置函数,具有很广泛的广式广式性。而其他函数特定的适用于某些特定的数据类型
For 循环
基本语法:
for <index variable> in <sequence>:
<command 1>
<command 2>
..........
<command x>
语法一定要会,不然白搭
<sequence>可以字符串或是对象的集合,<index variable>是变量名,是遍历时提取元素的值。第一次循环取得第一个值,依次向下,通过缩进四个字符标记循环体,指令最后执行退出循环体
>>> for amino_acid in 'ABCDEFGHIKLMNPQRSTVWY':
... number = proseq.count(amino_acid)
... print(amino_acid , number)
...
A 7
B 3
C 0
D 3
E 0
F 6
G 0
H 1
I 0
K 7
L 2
M 2
N 4
P 0
Q 2
R 0
S 2
T 0
V 0
W 2
Y 0
这里注意Python3以上版本print需要print(),此外print可以利用,分离同时打印多个
实战;Telomerase reverse transcriptase中哪个氨基酸出现最频繁?
telomerase = '''MSITDLSPTLGILRSLYPHVQVLVDFADDIVFREGHKATLIEESDTSHFKSFVRGIFVCF
... HKELQQVPSCNQICTLPELLAFVLNSVKRKRKRNVLAHGYNFQSLAQEERDADQFKLQGD
... VTQSAAYVHGSDLWRKVSMRLGTDITRYLFESCSVFVAVPPSCLFQVCGIPIYDCFSLAT
... ASLGFSLQSRGCRERCLGVNSMKRRAFNVKRYLRKRKTETDQKDEARVCSGKRRRVMEED
... KVSCETMQDGESGKTTLVQKQPGSKKRSEMEATLLPLEGGPSWRSGTFPPLPPSQSFMRT
... LGFLYGGRGMRSFLLNRKKKTAEGFRKIQGRDLIRIVFFEGVLYLNGLERKPKKLPRRFF
... NMVPLFSQLLRQHRRCPYSRLLQKTCPLVGIKDAGQAELSSFLPQHCGSHRVYLFVRECL
... LAVIPQELWGSEHNRLLYFARVRFFLRSGKFERLSVAELMWKIKVNNCDWLKISKTGRVP
... PSELSYRTQILGQFLAWLLDGFVVGLVRACFYATESMGQKNAIRFYRQEVWAKLQDLAFR
... SHISKGQMVELTPDQVAALPKSTIISRLRFIPKTDGMRPITRVIGADAKTRLYQSHVRDL
... LDMLRACVCSTPSLLGSTVWGMTDIHKVLSSIAPAQKEKPQPLYFVKMDVSGAYESLPHN
... KLIEVINQVLTPVLNEVFTIRRFAKIWADSHEGLKKAFIRQADFLEANMGSINMKQFLTS
... LQKKGKLHHSVLVEQIFSSDLEGKDALQFFTQILKGSVIQFGKKTYRQCQGVPQGSAVSS
... VLCCLCYGHMENVLFKDIINKKSCLMRLVDDFLLITPNLHDAQTFLKILLAGVPQYGLVV
... NPQKVVVNFEDYGSTDSCPGLRVLPLRCLFPWCGLLLDTHTLDIYKDYSSYADLSLRYSL
... TLGSCHSAGHQMKRKLMGILRLKCHALFLDLKTNSLEAIYKNIYKLLLLHALRFHVCAQS
... LPFGQSVAKNPAYFLLMIWDMVEYTNYLIRLSNNGLISGSTSQTGSVQYEAVELLFCLSF
... LLVLSKHRRLYKDLLLHLHKRKRRLEQCLGDLRLARVRQAANPRNPLDFLAIKT'''
(注意这里一定要用‘’‘ ’‘’ 不然你试试用’‘’能概括这么多蛋白序列,中间又不能用\来继续)
>>> telomerase
'MSITDLSPTLGILRSLYPHVQVLVDFADDIVFREGHKATLIEESDTSHFKSFVRGIFVCF\nHKELQQVPSCNQICTLPELLAFVLNSVKRKRKRNVLAHGYNFQSLAQEERDADQFKLQGD\nVTQSAAYVHGSDLWRKVSMRLGTDITRYLFESCSVFVAVPPSCLFQVCGIPIYDCFSLAT\nASLGFSLQSRGCRERCLGVNSMKRRAFNVKRYLRKRKTETDQKDEARVCSGKRRRVMEED\nKVSCETMQDGESGKTTLVQKQPGSKKRSEMEATLLPLEGGPSWRSGTFPPLPPSQSFMRT\nLGFLYGGRGMRSFLLNRKKKTAEGFRKIQGRDLIRIVFFEGVLYLNGLERKPKKLPRRFF\nNMVPLFSQLLRQHRRCPYSRLLQKTCPLVGIKDAGQAELSSFLPQHCGSHRVYLFVRECL\nLAVIPQELWGSEHNRLLYFARVRFFLRSGKFERLSVAELMWKIKVNNCDWLKISKTGRVP\nPSELSYRTQILGQFLAWLLDGFVVGLVRACFYATESMGQKNAIRFYRQEVWAKLQDLAFR\nSHISKGQMVELTPDQVAALPKSTIISRLRFIPKTDGMRPITRVIGADAKTRLYQSHVRDL\nLDMLRACVCSTPSLLGSTVWGMTDIHKVLSSIAPAQKEKPQPLYFVKMDVSGAYESLPHN\nKLIEVINQVLTPVLNEVFTIRRFAKIWADSHEGLKKAFIRQADFLEANMGSINMKQFLTS\nLQKKGKLHHSVLVEQIFSSDLEGKDALQFFTQILKGSVIQFGKKTYRQCQGVPQGSAVSS\nVLCCLCYGHMENVLFKDIINKKSCLMRLVDDFLLITPNLHDAQTFLKILLAGVPQYGLVV\nNPQKVVVNFEDYGSTDSCPGLRVLPLRCLFPWCGLLLDTHTLDIYKDYSSYADLSLRYSL\nTLGSCHSAGHQMKRKLMGILRLKCHALFLDLKTNSLEAIYKNIYKLLLLHALRFHVCAQS\nLPFGQSVAKNPAYFLLMIWDMVEYTNYLIRLSNNGLISGSTSQTGSVQYEAVELLFCLSF\nLLVLSKHRRLYKDLLLHLHKRKRRLEQCLGDLRLARVRQAANPRNPLDFLAIKT'
>>>
>>> for amino in "ABCDEFGHIKLMNPQRSTVWY":
... number = telomerase.count(amino)
... print(amino , number)
...
A 56
B 0
C 32
D 46
E 46
F 59
G 64
H 28
I 47
K 73
L 146
M 24
N 31
P 44
Q 54
R 79
S 83
T 46
V 73
W 11
Y 32
好啦,现在很明显是L lys是最多的,B是最少的呀,都没有,但是好像突然发现有点问题,一个len()函数瞅瞅
>>> len('ABCDEFGHIKLMNPQRSTVWY')
21
瞬间好像明白了什么,下次aa单字符再写错,直接面壁思过!