Defining genomic regions
First let’s install the latest version of BEDTools:
git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make clean; make all
bin/bedtools --version
#bedtools v2.20.1-4-gb877b35
cd ..
Now download the GENCODE version 19 (which is currently the latest update):
v=19
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_$v/gencode.v$v.annotation.gtf.gz
#what does this file look like?
zcat gencode.v19.annotation.gtf.gz | head
##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-12-05
chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
In the third column of the GTF file corresponds to the feature type. How many different types of features are there?
zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f3 | sort | uniq -c | sort -k1rn
1196293 exon
723784 CDS
284573 UTR
196520 transcript
84144 start_codon
76196 stop_codon
57820 gene
114 Selenocysteine
So the exonic regions are already defined; however, I would like to remove overlapping exons. The mergeBed utility in BEDTools can accomplish this; however before merging make sure the coordinates are sorted. We can use sortBed to sort the BED file:
v=19
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/mergeBed -i - | gzip > gencode_v${v}_exon_merged.bed.gz
To define intronic regions, we just need to subtract the exonic region from the genic region. The utility subtractBed can do this:
v=19
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/subtractBed -a stdin -b gencode_v${v}_exon_merged.bed.gz |
gzip > gencode_v${v}_intron.bed.gz
#let's intersect the two files
#this shouldn't produce any output
intersectBed -a gencode_v${v}_exon_merged.bed.gz -b gencode_v${v}_intron.bed.gz
And finally to define intergenic regions, we use complementBed to find regions not covered by genes.
v=19
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
"select chrom, size from hg19.chromInfo" > hg19.genome
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/complementBed -i stdin -g hg19.genome |
gzip > gencode_v${v}_intergenic.bed.gz
And that’s it.
[图片上传失败...(image-8bf86a-1540296254859)]
- Here’s a schematic showing the steps we performed above.
Genome coverage
How much of the genome does exons, introns, and intergenic regions cover?
#!/bin/env perl
use strict;
use warnings;
my $v = 19;
my $exon_file = "gencode_v${v}_exon_merged.bed.gz";
my $intergenic_file = "gencode_v${v}_intergenic.bed.gz";
my $intron_file = "gencode_v${v}_intron.bed.gz";
my $exon_coverage = coverage($exon_file);
my $intergenic_coverage = coverage($intergenic_file);
my $intron_coverage = coverage($intron_file);
my $total = $exon_coverage + $intergenic_coverage + $intron_coverage;
printf "Exon: %.2f\n", $exon_coverage*100/$total;
printf "Intron: %.2f\n", $intron_coverage*100/$total;
printf "Intergenic: %.2f\n", $intergenic_coverage*100/$total;
sub coverage {
my ($infile) = @_;
my $coverage = 0;
open(IN,'-|',"zcat $infile") || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
my ($chr, $start, $end) = split(/\t/);
my $c = $end - $start;
$coverage += $c;
}
close(IN);
return($coverage);
}
exit(0);
Running this script:
coverage.pl
Exon: 3.72
Intron: 48.25
Intergenic: 48.03
I had initially thought that the intergenic region would have the highest percent coverage in the genome compared to introns and exons.
The untranslated region (UTR)
I got a question (see comments) regarding the untranslated region (UTR). To illustrate my answer, I’ll use the transcript ENST00000335295 as an example:
v=19
zcat gencode.v$v.annotation.gtf.gz | grep ENST00000335295
chr11 HAVANA transcript 5246694 5248301 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA exon 5248160 5248301 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1; exon_id "ENSE00001829867.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA CDS 5248160 5248251 . - 0 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1; exon_id "ENSE00001829867.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA start_codon 5248249 5248251 . - 0 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1; exon_id "ENSE00001829867.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA exon 5247807 5248029 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2; exon_id "ENSE00001057381.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA CDS 5247807 5248029 . - 1 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2; exon_id "ENSE00001057381.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA exon 5246694 5246956 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3; exon_id "ENSE00001600613.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA CDS 5246831 5246956 . - 0 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3; exon_id "ENSE00001600613.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA stop_codon 5246828 5246830 . - 0 gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3; exon_id "ENSE00001600613.2"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA UTR 5248252 5248301 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11 HAVANA UTR 5246694 5246830 . - . gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
This transcript (HBB) starts at 5248301 and ends at 5246694 on chromosome 11 (the coordinates are reversed because it’s on the negative strand). We see two lines with the UTR feature for this transcript, one starting at 5248301 and the other at 5246830.
So the coordinates for the 5′ UTR are chr11:5248252-5248301 and the 3′ UTR are chr11:5246694-5246830. Perhaps I should have picked a transcript that’s on the positive strand, however this was one example I knew where the transcript is short and well defined (contains the start and end codons, and the 5′ and 3′ UTRs). Which now leads me to wonder, how many transcripts have defined UTRs?
#!/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.annotation.gtf.gz>\n";
my $infile = shift or die $usage;
if ($infile =~ /\.gz/){
open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}
my %transcript = ();
my $current_transcript = '';
while(<IN>){
chomp;
next if (/^#/);
#chr11 HAVANA transcript 65265233 65273940 . + . gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
my @annotation = split(/;\s/,$annotation);
my $transcript_id = 'none';
if ($type eq 'transcript'){
foreach my $blah (@annotation){
my ($type,$name) = split(/\s+/,$blah);
if ($type eq 'transcript_id'){
$current_transcript = $name;
$current_transcript =~ s/"//g;
$transcript{$current_transcript} = 0;
}
}
if ($current_transcript eq 'none'){
die "No name for entry $.\n";
}
}
if ($type eq 'UTR'){
$transcript{$current_transcript}++;
}
}
close(IN);
foreach my $transcript (keys %transcript){
print "$transcript\t$transcript{$transcript}\n";
}
exit(0);
Now to run the script:
check_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr_number.out.gz
zcat transcript_utr_number.out | wc -l
196520
zcat transcript_utr_number.out | cut -f2 | sort | uniq -c | sort -k1rn | head
103800 0
34999 2
24234 3
11543 1
10352 4
4493 5
2264 6
1297 7
796 8
583 9
zcat transcript_utr_number.out | awk '$2==57 {print}'
ENST00000264319.7 57
Running the script above, I find that over half of the GENCODE transcripts (103,800/196,520) don’t have a defined UTRs. 34,999 transcripts have 2 defined UTRs, 24,234 have 3 because one of the UTR is spliced. The transcript ENST00000264319.7 had 57 UTR lines, which was the most UTR annotations for a transcript.
Defining the 5′ and 3′ UTRs
I’m not entirely sure why the UTRs weren’t separated into 5′ and 3′, especially when GENCODE went through the trouble of labelling the start_codon and the stop_codon. I’m going to assume that if a UTR is closer to the starting position of the transcript it’s the 5′ UTR and vice versa.
#!/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.annotation.gtf>\n";
my $infile = shift or die $usage;
if ($infile =~ /\.gz/){
open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}
my %transcript = ();
my $current_transcript = '';
my $transcript_start = 0;
my $transcript_end = 0;
while(<IN>){
chomp;
next if (/^#/);
#chr11 HAVANA transcript 65265233 65273940 . + . gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
my @annotation = split(/;\s/,$annotation);
my $transcript_id = 'none';
if ($type eq 'transcript'){
foreach my $blah (@annotation){
my ($type,$name) = split(/\s+/,$blah);
if ($type eq 'transcript_id'){
$current_transcript = $name;
$current_transcript =~ s/"//g;
$transcript_start = $start;
$transcript_end = $end;
}
}
if ($current_transcript eq 'none'){
die "No name for entry $.\n";
}
}
if ($type eq 'UTR'){
my $region = '';
if ($strand eq '+'){
my $dis_to_start = abs($start - $transcript_start);
my $dis_to_end = abs($start - $transcript_end);
$region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR';
} else {
my $dis_to_start = abs($end - $transcript_end);
my $dis_to_end = abs($end - $transcript_start);
$region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR';
}
print join ("\t", $chr, $start, $end, $region, $current_transcript, $strand),"\n";
}
}
close(IN);
exit(0);
To run the script:
print_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr.bed.gz
zcat transcript_utr.bed.gz | head
chr1 70006 70008 3_UTR ENST00000335137.3 +
chr1 137621 138532 5_UTR ENST00000423372.3 -
chr1 139310 139379 5_UTR ENST00000423372.3 -
chr1 134901 135802 3_UTR ENST00000423372.3 -
chr1 367640 367658 5_UTR ENST00000426406.1 +
chr1 368595 368634 3_UTR ENST00000426406.1 +
chr1 621059 621098 3_UTR ENST00000332831.2 -
chr1 622035 622053 5_UTR ENST00000332831.2 -
chr1 819981 819983 3_UTR ENST00000594233.1 +
chr1 860260 860328 5_UTR ENST00000420190.1 +
Basically the script gets the coordinates of the transcript, and whenever there is an UTR annotation, it checks to see if the UTR coordinates are closer to the start or the end of the transcript.
Promoter region
Here’s a script that reads the GENCODE annotation file, obtains the starting positions of each transcript, adds some user defined padding around this starting position and outputs it as a bed file. The starting position plus some padding is defined as the promoter region. Note that this script assumes that the hg19.genome file is available in the same directory as the script.
#!/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.annotation.gtf> <padding>\n";
my $infile = shift or die $usage;
my $span = shift or die $usage;
if ($span !~ /^\d+$/){
die "Please enter a numeric value for the padding\n";
}
my $hg19 = 'hg19.genome';
my %hg19 = ();
open(IN,'<',$hg19) || die "Could not open $hg19: $!\n";
while(<IN>){
chomp;
#chr9_gl000201_random 36148
my ($chr, $end) = split(/\t/);
$hg19{$chr} = $end;
}
close(IN);
if ($infile =~ /\.gz/){
open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}
while(<IN>){
chomp;
next if (/^#/);
#chr11 HAVANA transcript 65265233 65273940 . + . gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
next unless $type eq 'transcript';
my @annotation = split(/;\s/,$annotation);
my $transcript_id = 'none';
foreach my $blah (@annotation){
my ($type,$name) = split(/\s+/,$blah);
if ($type eq 'transcript_id'){
$transcript_id = $name;
$transcript_id =~ s/"//g;
}
}
if ($transcript_id eq 'none'){
die "No name for entry $.\n";
}
my $promoter_start = '';
my $promoter_end = '';
if ($strand eq '+'){
$promoter_start = $start - $span;
$promoter_end = $start + $span;
} else {
$promoter_start = $end - $span;
$promoter_end = $end + $span;
}
if ($promoter_start < 0){
warn "Adjusted promoter start to 0\n";
$promoter_start = 0;
} elsif ($promoter_end > $hg19{$chr}){
warn "Adjusted promoter end to $hg19{$chr}\n";
$promoter_end = $hg19{$chr};
}
print join("\t",$chr,$promoter_start,$promoter_end,$transcript_id,0,$strand),"\n";
}
close(IN);
exit(0);
Running the script:
promoter.pl gencode.v19.annotation.gtf.gz 200 | head
chr1 11669 12069 ENST00000456328.2 0 +
chr1 11672 12072 ENST00000515242.2 0 +
chr1 11674 12074 ENST00000518655.2 0 +
chr1 11810 12210 ENST00000450305.2 0 +
chr1 29170 29570 ENST00000438504.2 0 -
chr1 24686 25086 ENST00000541675.1 0 -
chr1 29170 29570 ENST00000423562.1 0 -
chr1 29370 29770 ENST00000488147.1 0 -
chr1 29606 30006 ENST00000538476.1 0 -
chr1 29354 29754 ENST00000473358.1 0 +
See also
See the BEDTools documentation for cool usage examples.
All code used in this post is available at https://github.com/davetang/defining_genomic_regions.
Print Friendly, PDF & Email
Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License.
SHARE THIS:
TwitterGoogleFacebookLinkedInEmail
LIKE THIS:
RELATED
GENCODE
September 12, 2012
In "bioinformatics"
Getting acquainted with analysing DNA sequencing data
December 16, 2015
In "genomics"
ENCODE RNA polymerase II ChIP-seq
March 23, 2014
In "bioinformatics"
Posted in genomics
Tagged bedtools, genome
22 COMMENTS ADD YOURS
ASAKI says:
July 22, 2013 at 10:33 pm
Hi Dave,
Great resources and tutorials to extract the genomic regions from the GENCODE file.
Here, I want to look into the 3’UTR and 5’UTR regions in the gencode gtf file
altough the gtf file has a thrid column containg UTR info, it does not seperate the 3’UTR and 5’UTR.
Moreover, in UCSC browser, it offers us the 3’UTR&5’UTR bed file for V14 gencode.
So my question is how can we find this info from the gencode file from http://www.gencodegenes.org/releases/17.html, i.e., V17, the latest version gtf file?
Thanks
REPLY
DAVO says:
July 22, 2013 at 11:37 pm
Hi Asaki,
Thank you for the compliment. As for the UTRs, since we know the orientation of the gene/transcript, from the gtf file, we know whether the UTR is at the 5′ or 3′ end.
Bioconductor contains GENCODE genes, but it is not the latest version (since it’s derived from the UCSC Genome Browser).
It wouldn’t be too difficult to write a Perl script that extracts UTRs based on the gene orientation but some people are against writing custom scripts. I’m fine with custom scripts as long as they work as expected and are shared.
Cheers,
Dave
REPLY
Pingback: Sequence conservation in vertebrates - Musings from a PhD candidate
SAAD says:
February 2, 2014 at 10:26 pm
How about doing the same thing on a plant genome or for other non model organism for which something like gencode is not available.
REPLY
DAVO says:
February 2, 2014 at 11:12 pm
Yeah just use another set of transcript annotations and adapt the same concept.
REPLY
SAAD KHAN says:
February 14, 2014 at 12:04 am
What if the organism does not have annotations of UTR and also how to get it in a format like gencode has. For example I have this file :-
ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Gmax/annotation/Gmax_189_gene.gff3.gz
what additional steps do I need to follow to get it into a format similar to gencode.
Also gencode has lot of annotations that my file does not so your script does not seems to work with it.
REPLY
DAVO says:
February 17, 2014 at 1:49 am
Hi Saad,
I downloaded the file and had a look; it has annotations for:
CDS
five_prime_UTR
gene
mRNA
three_prime_UTR
It seems the UTRs are defined already?
Cheers,
Dave
REPLY
SAAD KHAN says:
February 17, 2014 at 2:30 am
Hi Dave,
I was more interested to generate a file similar to what UCSC generates. Unfortunately my organism is not there in UCSC browser.
This is the kind of output I am looking for. Do you think if there are already tools to do that or do I have to write a script for it myself.
output by ucsc : –
chr1 14969 15038 NR_024540_utr3_1_0_chr1_14970_r 0 – NR_024540 WASH7P 653635 pseudogene
REPLY
ALOK says:
July 22, 2014 at 7:23 am
Thank you very much for the above executed scripts, i was able to cross check my ouput with that of bedtools. I have no transcript info. All i have is exon and cds.The difference lowest start point of cds and lowest start point of exon will be my 5’UTR and difference in highest will be my 3’UTR.
The following is my GFF file can you please help me extract the UTR regions?
Ca1 Gnomon gene 254 741 . – . ID=gene0;Name=LOC101497325;Dbxref=GeneID:101497325;gbkey=Gene;gene=LOC101497325
Ca1 Gnomon mRNA 254 741 . – . ID=rna0;Name=XM_004486173.1;Parent=gene0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon exon 602 741 . – . ID=id1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon exon 254 506 . – . ID=id2;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon CDS 602 663 . – 0 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
Ca1 Gnomon CDS 254 506 . – 1 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
Ca1 Gnomon gene 18089 20552 . – . ID=gene1;Name=LOC101488339;Dbxref=GeneID:101488339;gbkey=Gene;gene=LOC101488339
Ca1 Gnomon gene 165984 168949 . + . ID=gene10;Name=LOC101493009;Dbxref=GeneID:101493009;gbkey=Gene;gene=LOC101493009
Ca1 Gnomon mRNA 165984 168949 . + . ID=rna18;Name=XM_004485362.1;Parent=gene10;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon exon 165984 166302 . + . ID=id169;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon exon 167289 168949 . + . ID=id170;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon CDS 167476 168681 . + 0 ID=cds17;Name=XP_004485419.1;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XP_004485419.1;gbkey=CDS;product=uncharacterized protein LOC101493009;protein_id=XP_004485419.1
REPLY
DAVO says:
July 22, 2014 at 8:01 am
Hi Alok,
if I understood you correctly, then the 5′ UTR for LOC101497325 is Ca1:663-741 and the 3′ UTR is Ca1:254-506 and for LOC101493009 the 5′ UTR is Ca1:165984-166302 and the 3′ UTR is Ca1:168681-168949.
Cheers,
Dave
REPLY
JULIEN says:
September 16, 2014 at 8:37 am
Thanks Dave for this very nice post.
I found a small in the conversion of GTF format to BED format. GTF has 1-based start coordinates, and BED has 0-based start coordinates. Thus the awk command to convert should for example be:
zcat gencode.v$v.annotation.gtf.gz |
awk ‘BEGIN{OFS=”\t”;} $3==”exon” {print $1,$4-1,$5}’
etc.
Thanks
Julien
REPLY
DAVO says:
September 16, 2014 at 8:40 am
Thanks Julien! This off-by-one error caused by 0- and 1-based coordinates is probably the most common bioinformatic error 🙂
REPLY
BRUCE says:
November 26, 2014 at 4:48 pm
Hi Dave,
nice piece of work, thanks. Wondering about your thoughts on how to avoid issues when merging the exons due to overlapping genes on each strand. How do you then know which gene/transcript is aligned to?
Thanks,
Bruce.
REPLY
DAVO says:
November 27, 2014 at 5:33 am
Hi Bruce,
thanks for the comment. This post was definitely simplifying the task of annotating mapped reads, which as you point out can get quite complicated (and messy).
One approach a colleague took, was to create a table for mapped reads (or clusters of reads), where each column was a genomic property, e.g. antisense to promoter, sense to exon, etc., and each column would be filled with the transcript ID that fulfilled that property. At the end, he could be able to tell the properties of a read. Needless to say he didn’t merge exons, like I did, or else he would lose information.
Cheers,
Dave
REPLY
JAY says:
March 27, 2015 at 11:52 am
Can you share your intergenicRegion.bed file?
REPLY
BENT says:
October 26, 2015 at 8:12 am
Thanks a lot, Dave.
It’s a wonderful post. While when I download gencode.v20.annotation.gtf.gz and so forth, it gives the error messages as below,
====
Checking and Creating Intergenic Regions
Warning: end of BED entry exceeds chromosome length. Please correct.
====
I also check the entry count for each output.
====
152320 gencode.v19.annotation.gtf.gz
8543 gencode_v19_exon_merged.bed.gz
575 gencode_v19_intergenic.bed.gz
6805 gencode_v19_intron.bed.gz
8060 promoter.bed.gz
8243 transcript_utr.bed.gz
3292 transcript_utr_number.out.gz
====
158000 gencode.v23.annotation.gtf.gz
8116 gencode_v23_exon_merged.bed.gz
187 gencode_v23_intergenic.bed.gz
6998 gencode_v23_intron.bed.gz
8731 promoter.bed.gz
8234 transcript_utr.bed.gz
3068 transcript_utr_number.out.gz
====
The gencode_v23_intergenic.bed.gz is much smaller than gencode_v19_intergenic.bed.gz, is it correct output?
Thanks a lot for your comment.
REPLY
DAVO says:
October 26, 2015 at 10:15 am
Hi Ben,
GENCODE version 19 is the last version of GENCODE that uses the hg19 reference genome; that’s why you’re getting the warnings (a good thing!).
In the MySQL command, change it to: “select chrom, size from hg38.chromInfo” and save the output as hg38.genome and change the workflow accordingly.
Cheers,
Dave
REPLY
MAYANK KAASHYAP says:
July 27, 2016 at 6:04 am
Hi Dave,
Very nice post. All goes well except when I try to isolate intergenic regions. I get an error: Less than the req’d two fields were encountered in the genome file. May I please request you to help me!
REPLY
DAVO says:
July 28, 2016 at 6:38 am
Can you confirm that your genome file looks something like this?
cat hg19.genome | head
chrom size
chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chrX 155270560
chr8 146364022
REPLY
MAYANK KAASHYAP says:
July 28, 2016 at 10:59 am
Hi Dave,
Thanks so much. It works very well now!
I was put Ca1 instead refseq Sequence NC_021160.1.
Chromosome Size
NC_021160.1 48359943
NC_021161.1 36634854
NC_021162.1 39989001
NC_021163.1 49191682
NC_021164.1 48169137
NC_021165.1 59463898
NC_021166.1 48961560
NC_021167.1 16477302
REPLY
MAYANK KAASHYAP says:
May 16, 2017 at 10:34 am
Hi Dave,
Thansk for this. How can I cite you.
REPLY
DAVO says:
May 17, 2017 at 1:44 am
Hi Mayank,
you can use the URL of the post if you really want to cite me. Otherwise, just cite:
https://www.ncbi.nlm.nih.gov/pubmed/20110278
Cheers,
Dave
REPLY
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment
Name *
Email *
Website
Save my name, email, and website in this browser for the next time I comment.
Notify me of follow-up comments by email.
Notify me of new posts by email.
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Search for:
Search …
WHO'S ONLINE
33 visitors online now
23 guests, 10 bots, 0 members
Map of Visitors
Creative Commons License