Ensembl数据库
Ensembl数据库是European Bioinformatics Institute成立的,里面除了可以下载基因组数据,还有很多比较基因组学的资源。例如可以从BioMart中直接下载物种之间的Ortholog以及Paralog,特别针对小麦这种六倍体植物,BioMart还提供了ABD的同源基因。同时还可以利用Ensembl API批量获取数据库后台存储的基因组注释、基因组之间alignment结果以及基因组的variance数据。
Ensembl Compara API使用
参考https://plants.ensembl.org/info/docs/api/compara/index.html
https://plants.ensembl.org/info/docs/api/compara/compara_tutorial.html
https://plants.ensembl.org/info/docs/api/general_instructions.html
Ensembl Compara数据库中有多物种基因组比对结果,有DNA序列水平的Whole genome alignment和共线性结果,以及基因水平的进化树和同源基因预测

通过Compara API从数据库中获取数据,对于脊椎动物的基因组比对结果,可以直接参照官方教程获取,而对于非脊椎动物,则需要自行修改配置。

访问数据库的第一步为连接到数据库中,需使用Registry模块连接到Ensembl数据库中。(注意,脊椎动物和其他物种,如植物,访问的host不同)
#连接到脊椎动物的Ensembl数据库中
use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org'
-user => 'anonymous'
);
#对于其他物种,如植物,需要修改host
use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'mysql-eg-publicsql.ebi.ac.uk',
-port => 4157
);
#同时连接到两个数据库中
use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_multiple_dbs(
{-host => 'mysql-eg-publicsql.ebi.ac.uk',
-port => 4157,
-user => 'anonymous'
},
{-host => 'ensembldb.ensembl.org',
-port => 5306,
-user => 'anonymous'
}
);
从Whole Genome Alignments中导出alignment结果

导出人和鼠的全基因组alignment结果,以基因组位置形式存储结果
use strict;
use warnings;
use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
# Define the query species and the coordinates of the Slice
my $query_species = 'human';
my $seq_region = '14';
my $seq_region_start = 75000000;
my $seq_region_end = 75010000;
# Get the SliceAdaptor and fetch a slice
my $slice_adaptor = $registry->get_adaptor( $query_species, 'core', 'Slice' );
my $query_slice = $slice_adaptor->fetch_by_region( 'toplevel', $seq_region, $seq_region_start, $seq_region_end );
# Get the GenomeDB adaptor
my $genome_db_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomeDB' );
# Fetch GenomeDB objects for human and mouse:
my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens');
my $mouse_genome_db = $genome_db_adaptor->fetch_by_name_assembly('mus_musculus');
# Get the MethodLinkSpeciesSetAdaptor
my $method_link_species_set_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'MethodLinkSpeciesSet');
# Fetch the MethodLinkSpeciesSet object corresponding to LASTZ_NET alignments between human and mouse genomic sequences
my $human_mouse_lastz_net_mlss = $method_link_species_set_adaptor->fetch_by_method_link_type_GenomeDBs( "LASTZ_NET", [$human_genome_db, $mouse_genome_db] );
# Get the GenomicAlignBlockAdaptor
my $genomic_align_block_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomicAlignBlock' );
# Fetch all the GenomicAlignBlocks corresponding to this Slice from the pairwise alignments (LASTZ_NET) between human and mouse
my @genomic_align_blocks = @{ $genomic_align_block_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice( $human_mouse_lastz_net_mlss, $query_slice ) };
# We will then (usually) need to restrict the blocks to the required positions in the reference sequence
foreach my $genomic_align_block( @genomic_align_blocks ) {
my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
}
foreach my $genomic_align_block( @genomic_align_blocks ) {
my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
# fetch the GenomicAligns and move through
my @genomic_aligns = @ { $restricted_gab->get_all_GenomicAligns };
foreach my $genomic_align (@genomic_aligns) {
my $species = $genomic_align->genome_db->get_scientific_name;
my $slice = $genomic_align->get_Slice;
print $species, "\t", $slice->seq_region_name, ":", $slice->seq_region_start, "-", $slice->seq_region_end, "\t";
}
print "\n";
}
而对于脊椎动物之外的物种,如植物,则需修改host及物种名,以拟南芥和小麦为例
需要修改的地方有
1)host改为mysql-eg-publicsql.ebi.ac.uk,prot为4257
2)get_adaptor中将Multi改为Plants
3)拟南芥物种名为arabidopsis_thaliana,小麦物种名为triticum_aestivum
#通过Ensembl API获取拟南芥1号染色体中与小麦同源的序列位置信息
use strict;
use warnings;
use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'mysql-eg-publicsql.ebi.ac.uk',
-port => 4157
);
# Define the query species and the coordinates of the Slice
my $query_species = 'arabidopsis_thaliana';
my $seq_region = '1';
my $seq_region_start = 1;
my $seq_region_end = 30427671;
# Get the SliceAdaptor and fetch a slice
my $slice_adaptor = $registry->get_adaptor( $query_species, 'core', 'Slice' );
my $query_slice = $slice_adaptor->fetch_by_region( 'toplevel', $seq_region, $seq_region_start, $seq_region_end );
# Get the GenomeDB adaptor
my $genome_db_adaptor = $registry->get_adaptor( 'Plants', 'compara', 'GenomeDB' );
# Fetch GenomeDB objects for tair and wheat:
my $tair10_genome_db = $genome_db_adaptor->fetch_by_name_assembly('arabidopsis_thaliana');
my $IWGSC_genome_db = $genome_db_adaptor->fetch_by_name_assembly('triticum_aestivum');
# Get the MethodLinkSpeciesSetAdaptor
my $method_link_species_set_adaptor = $registry->get_adaptor( 'Plants', 'compara', 'MethodLinkSpeciesSet');
# Fetch the MethodLinkSpeciesSet object corresponding to LASTZ_NET alignments between tair and wheat genomic sequences
my $tair10_IWGSC_lastz_net_mlss = $method_link_species_set_adaptor->fetch_by_method_link_type_GenomeDBs( "LASTZ_NET", [$tair10_genome_db, $IWGSC_genome_db] );
# Get the GenomicAlignBlockAdaptor
my $genomic_align_block_adaptor = $registry->get_adaptor( 'Plants', 'compara', 'GenomicAlignBlock' );
# Fetch all the GenomicAlignBlocks corresponding to this Slice from the pairwise alignments (LASTZ_NET) between tair and wheat
my @genomic_align_blocks = @{ $genomic_align_block_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice( $tair10_IWGSC_lastz_net_mlss, $query_slice ) };
# We will then (usually) need to restrict the blocks to the required positions in the reference sequence
foreach my $genomic_align_block( @genomic_align_blocks ) {
my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
}
foreach my $genomic_align_block( @genomic_align_blocks ) {
my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
# fetch the GenomicAligns and move through
my @genomic_aligns = @ { $restricted_gab->get_all_GenomicAligns };
foreach my $genomic_align (@genomic_aligns) {
my $species = $genomic_align->genome_db->get_scientific_name;
my $slice = $genomic_align->get_Slice;
print $species, "\t", $slice->seq_region_name, ":", $slice->seq_region_start, "-", $slice->seq_region_end, "\t";
}
print "\n";
}
上述代码的结果为
