[BioC] DEXSeq: problem with dexseq_prepare_annotation.py

Mon Jun 25 16:54:46 CEST 2012

Stephen Turner <vustephen at ...> writes:

> 
> Alejandro, Simon, Wolfgang, et al.:
> 
> I'm trying to use the dexseq_prepare_annotation.py script to parse the
> UCSC hg18 genes.gtf GTF file included with the Illumina igenomes
> packages (http://tophat.cbcb.umd.edu/igenomes.html). I'm getting an
> error:
> 
> Traceback (most recent call last):
>   File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 93, in <module>
>     raise ValueError, "Same name found on two chromosomes: %s, %s" % (
> str(l[i]), str(l[i+1]) )
> ValueError: Same name found on two chromosomes: <GenomicFeature:
> exonic_part 'CFB' at chr6_qbl_hap2: 3167392 -> 3167602 (strand '+')>,
> <GenomicFeature: exonic_part 'CFB' at chr6_cox_hap1: 3359983 ->
> 3360325 (strand '+')>
> 
> I'm guessing this is because the same gene name is found in two
> separate places. I'm not entirely sure what these two chromosomal
> segments refer to, but I removed them from the GTF file and the python
> script threw another error:
> 
> Traceback (most recent call last):
>   File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 91, in <module>
>     assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early"
> AssertionError: <GenomicFeature: exonic_part 'HIST2H3C+HIST2H3A' at
> chr1: 148079388 -> 148078883 (strand '-')> starts too early
> 
> I'm really unsure what to make of this or how to fix it. The script
> works without issues with the Ensembl GTF. Any help would be greatly
> appreciated.
> 
> Stephen
> 
> -----------------------------------------
> Stephen D. Turner, Ph.D.
> Bioinformatics Core Director
> University of Virginia School of Medicine
> bioinformatics.virginia.edu
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at ...
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 

Dear Stephen,

I had the same problem when I tried dexseq_prepare_annotation.py with the mm9 or
mm10 GTF files from the Illumina igenomes collection. And like you have
mentioned it worked well with an Ensembl version. Going through the script and
the data files, I have realized all the problems go back to one root: Ensembl
has unique gene_id for each locus whereas other files have gene_id generated
from gene_name attribute. This replicates the gene_id for some loci as there are
multiple (cis/trans) coding regions. For a quick fix I have done the following
modification to the script (around line# 28):

f.attr['gene_id'] = f.iv.chrom + '_' + f.attr['gene_id'].replace( ":", "_" ) +
f.iv.strand

This generates a 'unique' gene_id for the script by combining the chromosome
number, gene name and strand information. As I said, it is a quick fix but it
seems to work so far without problems. I hope it might be of use for you.

Best,

Bulak Arpat, PhD
Bioinformatician
Center for Integrative Genomics
University of Lausanne