一:数据下载
从http://plants.ensembl.org/index.html分别下载这两个物种的GFF文件和CDS序列.
# Athaliana
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-44/fasta/arabidopsis_thaliana/cds/Arabidopsis_thaliana.TAIR10.cds.all.fa.gz
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-44/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.44.gff3.gz
# Alyrata
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-44/fasta/arabidopsis_lyrata/cds/Arabidopsis_lyrata.v.1.0.cds.all.fa.gz
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-44/gff3/arabidopsis_lyrata/Arabidopsis_lyrata.v.1.0.44.gff3.gz
二:数据预处理
将GFF3转成BED格式
python -m jcvi.formats.gff bed --type=mRNA --key=transcript_id Arabidopsis_thaliana.TAIR10.44.gff3.gz > ath.bed
python -m jcvi.formats.gff bed --type=mRNA --key=transcript_id Arabidopsis_lyrata.v.1.0.44.gff3.gz > aly.bed
然后将bed进行去重复
python -m jcvi.formats.bed uniq ath.bed
python -m jcvi.formats.bed uniq aly.bed
最后我们得到了ath.uniq.bed和osa.uniq.bed, 根据bed文件第4列就可以用于提取cds序列和蛋白序列。
seqkit grep -f <(cut -f 4 aly.bed ) Arabidopsis_lyrata.v.1.0.cds.all.fa.gz | seqkit seq -i > aly.cds
seqkit grep -f <(cut -f 4 ath.bed ) Arabidopsis_thaliana.TAIR10.cds.all.fa.gz | seqkit seq -i > ath.cds
blast比对
makeblastdb -in aly.cds -out db/aly -dbtype nucl
blastn -num_threads 20 -query ath.cds -db db/aly -outfmt 6 -evalue 1e-5 -num_alignments 5 > ath_aly.blast
用jcvi.compara.blastfilter对结果进行过滤
python -m jcvi.compara.blastfilter --no_strip_names ath_aly.blast --sbed aly.bed --qbed ath.bed
最后输出aly_ath.blast.filtered用于做图
python -m jcvi.graphics.blastplot ath_aly.blast.filtered --sbed aly.bed --qbed ath.bed