1.准备文件
1)需要蛋白文件和编码蛋白的核酸序列,文件格式fasta格式,对应的蛋白和核酸的id要一样
==> all.cds1 <==
>Zm_accD
ATGTTCAAATATCAATGCAATTTTTTATTTTCCAACGAGAAGGTAGAACA
TCATAGATGTGGGCCGAGTAAATCAATAGATAGTGTTGATAGTATTGGAC
ATACAGATAGAAGTGAACAACCTATTCTAAACGATATGGAGAAAAAGGTT
CCTAGTTGGAATCGTATTAGTAATTATAGTTTCAATAATGTTGATTATTT
ATTTGATATCAGGAATATTTGGAGTTTGATCTCTGATGACACTTTTTTTG
TTAGGGATAGTAATGGTGATCGTTACTCTATATTTTTTGATATTGAAAAT
CATATTTTTGAGGTTGACAATGATAGTTCCTTTATGAGTGAACTAGAAAT
TATTGTTTCTAGTTATTTGAATAGGGGGTCTAAGTCTAAGAAAAAGAATC
ATACATGTAATGATACTGAATCCAGTTGGAAAAAGAACATTATTAGTAGC
==> all.pep1 <==
>Zm_accD
MFKYQCNFLFSNEKVEHHRCGPSKSIDSVDSIGHTDRSEQPILNDMEKKV
PSWNRISNYSFNNVDYLFDIRNIWSLISDDTFFVRDSNGDRYSIFFDIEN
HIFEVDNDSSFMSELEIIVSSYLNRGSKSKKKNHTCNDTESSWKKNIISS
IDSYLRFEVSINSSISSSTNESYIYNFICTENKNSSESDRSSIRTSQNID
DLDIRVEESNHNDNPFYKFRHLWVQCENCYGAHYKQFFGEKMYICEFCGY
HLKMSSSDRIELSIDPGTWDPMDEYMVSVDPIEFDSPVEFDEEYEDEEPD
SDRDHIDFSRDQPEDDDSYIDRIDSYQRETGLNEAVQTGIGQLNGIPVAF
GVMDFQFMGGSMGSVVGEKITRLIEYATNRSLPIIIVCASGGARMQEGSL
SLMQMAKISSVLYNYQLNKKLFYVAILTDPTTGGVTASFAMLGDIIIAEP
注意序列的id要完全一致,后面不要跟空格之类的字符
2)同源蛋白分组信息,一个分组的蛋白会进行比对,计算值
for i in `fastalength all.pep | cut -f 2 -d '_' | sort | uniq -c |awk '$1==3' | grep -v '\-D' | awk '{print $2}'`;do fastalength all.pep | grep -v '\-D' | grep "$i\>" | awk '{printf"%s\t",$2}END{printf"\n"}'>> all.homo; done #生成分组信息表
###
Kg_accD Az_accD
Kg_atpA Az_atpA
Kg_atpB Az_atpB
Kg_atpE Az_atpE
Kg_atpF Az_atpF
Kg_atpH Az_atpH
Kg_atpI Az_atpI
###
3.ParaAT对齐并计算Ka Ks值
ParaAT会先对齐氨基酸序列,然后根据氨基酸序列对齐核酸序列。并且有参数可以直接计算Ka Ks值。
perl /home/chenyw/bin/ParaAT.pl -h all.homo -n all.cds1 -a all.pep1 -p proc -o output -f axt -c 11 -k
ParaAT计算Ka Ks是用KaKs_Calculator计算的,采用的是默认参数,及-c 1 -m MA,所以如果要选用其它密码子表或者模型就要自己手动计算,接下来是手动计算的过程。
4.用KaKs_Calculator计算Ka Ks
/home/chenyw/bin/KaKs_Calculator -i Kg_psbN-Zs_psbN.cds_aln.axt -o Kg_psbN-Zs_psbN.cds_aln.axt.kaks -c 11 -m MS
补充:
ncbi的gbk文件存在不规范,可以按CDS直接提取,这里使用biopython接口。代码如下:
import argparse
parser = argparse.ArgumentParser(description='Search some files')
parser.add_argument(dest='filenames', metavar='filename')
args = parser.parse_args()
gbk_file=args.filenames
from Bio import SeqIO
for x in SeqIO.parse(gbk_file, "genbank"):
for x1 in x.features:
if x1.type=="CDS":
id='_'.join(x1.qualifiers['product'][0].split())
nc_seq=x1.extract(x.seq)
print('>'+id+'\n'+nc_seq)