Ensembl Compara API使用

Ensembl数据库

Ensembl数据库是European Bioinformatics Institute成立的,里面除了可以下载基因组数据,还有很多比较基因组学的资源。例如可以从BioMart中直接下载物种之间的Ortholog以及Paralog,特别针对小麦这种六倍体植物,BioMart还提供了ABD的同源基因。同时还可以利用Ensembl API批量获取数据库后台存储的基因组注释、基因组之间alignment结果以及基因组的variance数据。

Ensembl Compara API使用

参考https://plants.ensembl.org/info/docs/api/compara/index.html

https://plants.ensembl.org/info/docs/api/compara/compara_tutorial.html

https://plants.ensembl.org/info/docs/api/general_instructions.html

Ensembl Compara数据库中有多物种基因组比对结果,有DNA序列水平的Whole genome alignment和共线性结果,以及基因水平的进化树和同源基因预测

image.png

通过Compara API从数据库中获取数据,对于脊椎动物的基因组比对结果,可以直接参照官方教程获取,而对于非脊椎动物,则需要自行修改配置。

image.png

访问数据库的第一步为连接到数据库中,需使用Registry模块连接到Ensembl数据库中。(注意,脊椎动物和其他物种,如植物,访问的host不同)

 #连接到脊椎动物的Ensembl数据库中
 use Bio::EnsEMBL::Registry;
 my $registry = 'Bio::EnsEMBL::Registry';
 $registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org'
  -user => 'anonymous'
 );
 
 #对于其他物种,如植物,需要修改host
 use Bio::EnsEMBL::Registry;
 my $registry = 'Bio::EnsEMBL::Registry';
 $registry->load_registry_from_db(
  -host => 'mysql-eg-publicsql.ebi.ac.uk',
  -port => 4157
 );
 
 #同时连接到两个数据库中
 use Bio::EnsEMBL::Registry;
 my $registry = 'Bio::EnsEMBL::Registry';
 $registry->load_registry_from_multiple_dbs(
  {-host => 'mysql-eg-publicsql.ebi.ac.uk',
  -port => 4157, 
  -user => 'anonymous'
  },
  {-host => 'ensembldb.ensembl.org',
  -port => 5306,
  -user    => 'anonymous'
  }
 );

从Whole Genome Alignments中导出alignment结果

image.png

导出人和鼠的全基因组alignment结果,以基因组位置形式存储结果

 use strict;
 use warnings;
 use Bio::EnsEMBL::Registry;
 
 my $registry = 'Bio::EnsEMBL::Registry';
 
 $registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
 );
 
 # Define the query species and the coordinates of the Slice
 my $query_species = 'human';
 my $seq_region = '14';
 my $seq_region_start = 75000000;
 my $seq_region_end   = 75010000;
 
 # Get the SliceAdaptor and fetch a slice
 my $slice_adaptor = $registry->get_adaptor( $query_species, 'core', 'Slice' );
 my $query_slice = $slice_adaptor->fetch_by_region( 'toplevel', $seq_region, $seq_region_start, $seq_region_end );
 
 # Get the GenomeDB adaptor
 my $genome_db_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomeDB' );
 
 # Fetch GenomeDB objects for human and mouse:
 my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens');
 my $mouse_genome_db = $genome_db_adaptor->fetch_by_name_assembly('mus_musculus');
 
 # Get the MethodLinkSpeciesSetAdaptor
 my $method_link_species_set_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'MethodLinkSpeciesSet');
 
 # Fetch the MethodLinkSpeciesSet object corresponding to LASTZ_NET alignments between human and mouse genomic sequences
 my $human_mouse_lastz_net_mlss = $method_link_species_set_adaptor->fetch_by_method_link_type_GenomeDBs( "LASTZ_NET", [$human_genome_db, $mouse_genome_db] );
 
 # Get the GenomicAlignBlockAdaptor
 my $genomic_align_block_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomicAlignBlock' );
 
 # Fetch all the GenomicAlignBlocks corresponding to this Slice from the pairwise alignments (LASTZ_NET) between human and mouse
 my @genomic_align_blocks = @{ $genomic_align_block_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice( $human_mouse_lastz_net_mlss, $query_slice ) };
 
 # We will then (usually) need to restrict the blocks to the required positions in the reference sequence
 
 foreach my $genomic_align_block( @genomic_align_blocks ) {
  my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
 }
 
 foreach my $genomic_align_block( @genomic_align_blocks ) {
  my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);

  # fetch the GenomicAligns and move through
  my @genomic_aligns = @ { $restricted_gab->get_all_GenomicAligns };
  foreach my $genomic_align (@genomic_aligns) {
  my $species = $genomic_align->genome_db->get_scientific_name;
  my $slice = $genomic_align->get_Slice;
  print $species, "\t", $slice->seq_region_name, ":", $slice->seq_region_start, "-", $slice->seq_region_end, "\t";
  }
  print "\n";
 }

而对于脊椎动物之外的物种,如植物,则需修改host及物种名,以拟南芥和小麦为例

需要修改的地方有

1)host改为mysql-eg-publicsql.ebi.ac.uk,prot为4257

2)get_adaptor中将Multi改为Plants

3)拟南芥物种名为arabidopsis_thaliana,小麦物种名为triticum_aestivum

#通过Ensembl API获取拟南芥1号染色体中与小麦同源的序列位置信息
 use strict;
 use warnings;
 use Bio::EnsEMBL::Registry;
 
 my $registry = 'Bio::EnsEMBL::Registry';
 
 $registry->load_registry_from_db(
  -host => 'mysql-eg-publicsql.ebi.ac.uk',
  -port => 4157
 );
 
 # Define the query species and the coordinates of the Slice
 my $query_species = 'arabidopsis_thaliana';
 my $seq_region = '1';
 my $seq_region_start = 1;
 my $seq_region_end   = 30427671;
 
 # Get the SliceAdaptor and fetch a slice
 my $slice_adaptor = $registry->get_adaptor( $query_species, 'core', 'Slice' );
 my $query_slice = $slice_adaptor->fetch_by_region( 'toplevel', $seq_region, $seq_region_start, $seq_region_end );
 
 # Get the GenomeDB adaptor
 my $genome_db_adaptor = $registry->get_adaptor( 'Plants', 'compara', 'GenomeDB' );
 
 # Fetch GenomeDB objects for tair and wheat:
 my $tair10_genome_db = $genome_db_adaptor->fetch_by_name_assembly('arabidopsis_thaliana');
 my $IWGSC_genome_db = $genome_db_adaptor->fetch_by_name_assembly('triticum_aestivum');
 
 # Get the MethodLinkSpeciesSetAdaptor
 my $method_link_species_set_adaptor = $registry->get_adaptor( 'Plants', 'compara', 'MethodLinkSpeciesSet');
 
 # Fetch the MethodLinkSpeciesSet object corresponding to LASTZ_NET alignments between tair and wheat genomic sequences
 my $tair10_IWGSC_lastz_net_mlss = $method_link_species_set_adaptor->fetch_by_method_link_type_GenomeDBs( "LASTZ_NET", [$tair10_genome_db, $IWGSC_genome_db] );
 
 # Get the GenomicAlignBlockAdaptor
 my $genomic_align_block_adaptor = $registry->get_adaptor( 'Plants', 'compara', 'GenomicAlignBlock' );
 
 # Fetch all the GenomicAlignBlocks corresponding to this Slice from the pairwise alignments (LASTZ_NET) between tair and wheat
 my @genomic_align_blocks = @{ $genomic_align_block_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice( $tair10_IWGSC_lastz_net_mlss, $query_slice ) };
 
 # We will then (usually) need to restrict the blocks to the required positions in the reference sequence
 
 foreach my $genomic_align_block( @genomic_align_blocks ) {
  my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
 }
 
 foreach my $genomic_align_block( @genomic_align_blocks ) {
  my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);

  # fetch the GenomicAligns and move through
  my @genomic_aligns = @ { $restricted_gab->get_all_GenomicAligns };
  foreach my $genomic_align (@genomic_aligns) {
  my $species = $genomic_align->genome_db->get_scientific_name;
  my $slice = $genomic_align->get_Slice;
  print $species, "\t", $slice->seq_region_name, ":", $slice->seq_region_start, "-", $slice->seq_region_end, "\t";
  }
  print "\n";
 }

上述代码的结果为

image.png
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容