匡さん消えてのせかいえ

First, extract the longest transcript from GFF3 annotation files.

Here, we use AGAT by singularity container:

singularity exec agat_1.4.2--pl5321hdfd78af_0.sif agat_sp_keep_longest_isoform.pl -gff xxx.gff  -o longest.gff

Then, we used itools to extract the protein and coding sequence:

./iTools_Code/iTools Gfftools  getCdsPep -Ref  $ref.fa  -Gff longest.gff3  -OutPut $ref 
# $ref is the spceis name of the genome

And from this step, we can two files : $ref.cds.fa.gz and $ref.pep.fa.gz. We gzip -d the $ref.pep.fa file and modify the name of each protein. For example, the original format of the $ref.pep.fa could be like:

>evm.modle.Chr.1.1
SDLKDFSLJFSALDFSLADKFJLSDKFSADFJ ASDFJ ASLJFASLDKFJSLADKFJWERKLQWBEKQW
>evm.modle.Chr.1.2
ALDFSLADKFJLSDKFSADFJ ASDFJSLJFASLDKFJSLADKFJWERKLQWBEKQWSADFASDFSDFSD

We will add the specific species name of it by (For example, the species name is Mus musculus)

sed -i 's/>/>musculus\|/g'  $ref.pep.fa

And the protein file would be like this format after modifciation:

>musculus|evm.modle.Chr.1.1
SDLKDFSLJFSALDFSLADKFJLSDKFSADFJ ASDFJ ASLJFASLDKFJSLADKFJWERKLQWBEKQW

This is what format we need for download analysis.

So, we do above step on all species you want to include in this analysis, and then we will get all modified protein files. Here, we create a folder named "data", and put all protein files into it (excluding CDS fasta).

Then, we do orthofinder do get the orthogroups,

/path/to/orthofinder  -f  data -t 50 
# data is the name of the folder containing FASTA format protein files

Waiting for the results. We will used the results of the Orthofinder to do the original Agora analysis.

We need three type of input files for agora: 1. geneList 2. orthoGroup 3. speciestreewith labeled node.

The speciestree would be the easier to get, we can extract from the results of orthofinder:

cp ./data/OrthoFinder/*/Species_Tree/SpeciesTree_rooted_node_labels.txt 
sed -i 's/\.pep//g'  OrthoFinder/Results_Apr09/Species_Tree/SpeciesTree_rooted_node_labels.txt

The genelist would be harder to get if your GFF3 is not from EVM output. let's see the format of genelist first:

scaffold1       65783   65953   -1      musculus|evm.model.scaffold1.1
scaffold1       173368  173538  -1      musculus|evm.model.scaffold1.2
scaffold1       225013  225418  -1      musculus|evm.model.scaffold1.3
scaffold1       225775  226033  -1      musculus|evm.model.scaffold1.4
scaffold1       343420  343714  -1      musculus|evm.model.scaffold1.5

where 1 and -1 represent the strand.
If your gff3 is produced from EVMmoderler, you could use this PERL script EVMgff2Agora.pl to transfer: Usage: perl EVMgff2Agora.pl gff > Genelist/genes.$species.list

#! /usr/bin/perl
use strict;
use warnings;

my $input = shift;
open I, "<$input" or die "Usage: perl $0 gff > output";

while(<I>)
{
        chomp;
        if(/mRNA/)
        {
                my @arr = split(/\t/);
                $arr[8] =~ m/ID=(.*);Parent/;
                if ($arr[6] eq "+" )
                        {
                                print "$arr[0]\t$arr[3]\t$arr[4]\t1\t$1\n";
                        }
                else
                        {
                                print "$arr[0]\t$arr[3]\t$arr[4]\t-1\t$1\n";
                        }
        }
}

last we will get the orthogourp list from OrthoFinder/*/Phylogenetic_Hierarchical_Orthogroups/N*.tsv
We can use the convertTsv2agora.pl to do it. For N7 node, we can do perl convertTsv2agora.pl N0.tsv > orthologyGroups.N0.list.

Here is the detailed script of convertTsv2agora.pl

#! /usr/bin/perl
use strict;
use warnings;
binmode(STDIN, ":unix");
binmode(STDOUT, ":unix");
my $input = shift;
open I, "<$input" or die "";

my $line_number = 0;

while (my $line = <I>) {
        $line_number++;
        next if $line_number == 1;
        chomp $line;
        next if  $line =~ /^\s*$/;
        $line =~ s/,/ /g;
        my @columns = split /[\t ]+/, $line;
        next if @columns < 4;
        my @rest = @columns[3 .. $#columns];    
        my $clean_line = join(' ', @rest);
        $clean_line =~ s/\s+/ /g;
        $clean_line =~ s/^\s+//; 
        $clean_line =~ s/\s+$//;
        next if $clean_line =~ /^\s*$/;
        print  "$clean_line\n";
}
close I;

So until here we have prepared all files for agora. We put genes.%s.list in Genelist folder, orthologyGroups.%s.list in orthologyGroups folder and run (We are doing reserach on animals so we used agora-vertebrates.py ).

./Agora/src/agora-vertebrates.py SpeciesTree_rooted_node_labels.txt orthologyGroup/orthologyGroups.%s.list  Genelist/genes.%s.list  -workingDir=Agora.result

The results will be in Agora.result/ancGenomes/vertebrates-workflow/*.list.bz2. We will put the results of agora in ANGES to get the final results of ancestral karyotype restruction. First we will get the gene information from geneList and generate a BED format file. Using perl ConvertAgora2Anges.pl genes.<species>.list

#!/usr/bin/perl
use strict;
use warnings;
my $input = $ARGV[0] or die "Usage: $0 genes.<species>.list\n";
$input =~ /genes\.(\w+)\.list/ or die "Input file name must match: genes.<species>.list\n";
my $species = $1;
my $output = "$species.bed";
open(IN, $input) or die "Cannot open $input: $!\n";
open(OUT, ">", $output) or die "Cannot write to $output: $!\n";
while (<IN>) {
            chomp;
        my ($scaffold, $start, $end, $strand, $gene) = split(/\t/);
    my $symbol = $strand == 1 ? "+" : "-";
       print OUT join("\t", "$species\.$scaffold", $start, $end, $gene, 0, $symbol, $species), "\n";
}
close IN;

And we will get BED format annotaion for each species like:

cahirinus.Chr4  11944   12403   cahirinus|evm.model.Chr4.1      0       +       cahirinus
cahirinus.Chr4  13737   14065   cahirinus|evm.model.Chr4.2      0       +       cahirinus
cahirinus.Chr4  35907   79905   cahirinus|evm.model.Chr4.3      0       +       cahirinus

And will cat all files into one file:

cat *.bed > all.bed

Then we will get marker genes from ancGenome.N0.list.bz2:

bzcat  ancGenome.N0.list.bz2 >  ancGenome.N0.list
perl Process_agora.pl  ancGenome.N0.list Agora.out
perl GetMarker.pl Agora.out  all.bed  Input.marker 
##Noice : only contained genes from Chromosome
##but should not contained genes from Contig unanchored!!!!

Here is the script of Process_agora.pl and GetMarker.pl

Process_agora.pl

#!/usr/bin/perl
use strict;
use warnings;
my $input =  shift;
my $output = shift;
open(IN, "<", $input) or die "Cannot open $input: $!";
open(OUT, ">", $output) or die "Cannot open $output: $!";
while (<IN>) {
        next if /^singleton_/;  # 跳过 singleton 行
                chomp;
                my @fields = split(/\s+/, $_);
                next if @fields < 6;  # 少于6列跳过
                my @genes;
                for my $i (5 .. $#fields) {
                 my $gene = $fields[$i];
                  # 保留如 cahirinus|evm.model.X
                $gene =~ s/^([^.]+)\.[^|]+\|/$1|/;
                   push @genes, $gene;
                  }
                   print OUT join(" ", @genes), "\n";
  }      
close(IN);
close(OUT);

GetMarker.pl

use strict;
use warnings;

my $block_file = shift;
my $coord_file = shift;
my $output_file = shift;

my %coords;
open(COORD, "<", $coord_file) or die "Cannot open $coord_file: $!";
while (<COORD>) {
            chomp;
                my @cols = split(/\t/);
                    my ($chr, $start, $end, $gene_id, $strand, $species) = @cols[0,1,2,3,5,6];
                        $coords{$gene_id} = "$chr:$start-$end $strand";
                }
                close(COORD);


open(BLOCK, "<", $block_file) or die "Cannot open $block_file: $!";
open(OUT, ">", $output_file) or die "Cannot write to $output_file: $!";

my $block_num = 1;
while (<BLOCK>) {
            chomp;
            next if $_ =~ /^\s*$/;  # 跳过空行
           print OUT ">$block_num\n";
            my @genes = split(/\s+/, $_);  # 空格分隔
            foreach my $gene (@genes) {
                 if (exists $coords{$gene}) {
                  print OUT "$coords{$gene}\n";
                    } else {   warn "Warning: coordinate not found for $gene\n";
                 }
                       }
    print OUT "\n"; 
 # block 之间空行
                                                      $block_num++;
     }
close(BLOCK);

We look at the input.marker, but find:

>1
musculus.Chr3:70829216-70859612 -
rat.Chr3:78010827-78041764 +
human.Chr17:9108469-9137955 +

That's not good, we should modify and get :

musculus.3:70829216-70859612 -
rat.3:78010827-78041764 +
human.17:9108469-9137955 +

Label the tree by @ in your species tree:

((Homo_sapiens:0.01562,Pan_troglodytes:0.01562)@:0.0198250,Pongo_pygmaeus:0.03544)

Finally, modify the configure file exmPara

# Parameters File
# Project: BOREOEUTHERIAN
# Date: 2012-05-03
#
## Input files
#
markers_file Input.marker # file name for markers
tree_file SpeciesTree_rooted_node_labels.txt # file name for species tree
#
## Output options
#
output_directory N1 # directory name
output_ancestor acomys  # ancestor name
#
## Markers options
#
markers_doubled 0 # 0 = original markers, 1 = doubled markers
markers_unique 2 # 0 = repeated markers allowed, 1 = ingroup unique, 2 = unique
markers_universal 0 # 0 = missing markers allowed, 1 = ingroup universal, 2 = universal
#
## ACS options
#
#acs_pairs  # name of file containing the species pairs to compare
acs_sa 1 # supported adjacencies: 0 = not computed, 1 = computed
acs_ra 0 # reliable adjacencies: 0 = not computed, 1 = computed
acs_sci 1 # strong common intervals: 0 = not computed, 1 = computed
acs_mci 1 # maximal common intervals: 0 = not computed, 1 = computed
acs_aci 0 # all common intervals: 0 = not computed, 1 = computed
#acs_file  # ACS provided by user
acs_weight 1 # weighting ACS: 1 = linear interpolation
acs_correction 1 # Correcting for missing markers: 0 = none, 1 = adding markers spanned by intervals, 2 = X, 3 = both
#
## C1P model
#
c1p_linear 0 # 1 f or working with linear chromosomes
c1p_circular 0 # 1 for working with a unique circular chromosomes
#
## Telomeres (model+optimization)
#
c1p_telomeres 0 # 0: no telomere, 1: added after optimization (greedy heuristic), 2: added after optimization (bab), 3: added during optimization (bab)
#
c1p_heuristic 0 # Using a greedy heuristic
c1p_bab 0 # Using a branch-and-bound
c1p_spectral 1 # Using spectral seriation
#
## Spectral seriation options
#
c1p_spectral_alpha 1.0 # Spectral seriation alpha value
#
# END

Finally, run

python anges_CAR.py exmPara

Will output the karyotype results!

匡さん消えてのせかいえ

推荐阅读更多精彩内容