[BioC] [Hinxton #251937] RE: GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart
Hervé Pagès
hpages at fhcrc.org
Tue Mar 13 23:09:10 CET 2012
Hi Steffen,
On 03/13/2012 02:37 PM, Steffen Durinck wrote:
> Hi Herve,
>
> To answer your question:
>
> "Bioconductor biomaRt package is still accessing Ensembl Genes 65,
> I wonder why, but this is a different story..."
>
> By default biomaRt queries http://www.biomart.org , which hosts a copy
> of Ensembl. There is a time lag between an Ensembl update and an update
> of Ensembl on biomart.org <http://biomart.org>
Thanks Steffen for the details. Yes I knew about this lag, we see it at
each new Ensembl release. I guess the grumbling was more like "why on
earth every time it takes 2 weeks for the new Ensembl release to
propagate to http://biomart.org?". Or, "why on earth do we have to wait
2 weeks after each new Ensembl release to see our unit tests break in
the GenomicFeatures package?" ;-)
>
> An alternative is to query ensembl directly by specifying the host:
>
> > library(biomaRt)
> > listMarts(host="uswest.ensembl.org <http://uswest.ensembl.org>")
> biomart version
> 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 66
> 2 ENSEMBL_MART_SNP Ensembl Variation 66
> > mart =
> useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl",host="uswest.ensembl.org
> <http://uswest.ensembl.org>")
Thanks for the reminder. I wish they could use the same biomart name:
why "ensembl" on http://biomart.org and "ENSEMBL_MART_ENSEMBL" on
http://uswest.ensembl.org. Now I'll stop grumbling...
>
>
> Note that the normal ensembl host is www.ensembl.org
> <http://www.ensembl.org>, but for some reason if you use this on the US
> west coast, I end up in a redirect page to uswest.ensembl.org
> <http://uswest.ensembl.org> . This redirecting is something new and
> biomaRt won't work currently if you use www.ensembl.org
> <http://www.ensembl.org> as host when you're based in the US, so use
> uswest.ensembl.org <http://uswest.ensembl.org>
Thanks for the extra details.
Cheers,
H.
>
> Cheers,
> Steffen
>
>
>
>
> 2012/3/13 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
> Hi Malcolm, Rhoda,
>
> Did you hear back from the Ensembl helpdesk about this issue?
>
> AFAICT the issue is still in Ensembl release 66 (released 10 days
> ago). For example, when querying directly the Ensembl Mart, I get
> the following for transcript FBtr0079414 (dmelanogaster):
>
> Exon Rank in Transcript | Chromosome Name | Strand
> 1 | 2L | -1
> 2 | 2L | -1
>
> Exon Chr Start (bp) | Exon Chr End (bp)
> 7218909 | 7220029
> 7218643 | 7218853
>
> 5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End
> 7219112 | 7220029 | |
> | | 7218643 | 7218853
>
> CDS Start | CDS End | CDS Length
> 1 | 203 | 204
> 204 | 204 | 204
>
> Note that querying directly the Ensembl Mart thru the web interface
> allows me to choose database Ensembl Genes 66 but querying with the
> Bioconductor biomaRt package is still accessing Ensembl Genes 65,
> I wonder why, but this is a different story...
>
> So the "CDS Length" column (which, IIUC, is actually supposed to
> report the "Total CDS Length") is still incompatible with the
> exon/UTR starts and ends. If the exon/UTR starts and ends
> are correct then the total CDS length should be 203, not 204.
>
> But also, it could be that the exon/UTR starts and ends are
> incorrect.
>
> Finally note that there is no CDS region on exon 2 (the 3' UTR
> entirely spans exon 2) but the Ensembl Mart reports a CDS region
> of length 1 on this exon (CDS Start = CDS End = 204). This is
> probably why then the reported CDS Length is 204 (at least it's
> consistent with the highest "CDS End" value).
>
> Would be nice to see this dataset fixed.
>
> Thanks,
> H.
>
>
> On 02/15/2012 06:33 AM, Cook, Malcolm wrote:
>
> Dear helpdesk at ensemblgenomes.org
> <mailto:helpdesk at ensemblgenomes.org>,
>
> I am following up on this issue which I understand Rhoda
> Kinsella at EBI to have forwarded to you.
>
> I originally identified and reported the issue, first to the
> bioconductor email list where Rhoda picked up on it and replied
> as below.
>
> I am trying to ensure that there is a tracked issue with
> ensemblgenomes.org <http://ensemblgenomes.org> with my name on
> it – not that it has to be resolved with a fix, just I'd like to
> be assured I know as you resolve it.
>
> If there is anything further I can provide pertaining to
> describing or resolving the issue, please advise.
>
> Of course the issue may be in fact even further upstream – in
> flybase. I've not tried to find the root cause myself.
>
> Thanks,
>
> Malcolm Cook
>
>
> From: Rhoda Kinsella<rhoda at ebi.ac.uk
> <mailto:rhoda at ebi.ac.uk><__mailto:rhoda at ebi.ac.uk
> <mailto:rhoda at ebi.ac.uk>>>
> Date: Wed, 8 Feb 2012 10:27:02 -0600
> To: Malcolm Cook<mec at stowers.org
> <mailto:mec at stowers.org><mailto:me__c at stowers.org
> <mailto:mec at stowers.org>>>
> Cc: Hervé Pagès<hpages at fhcrc.org
> <mailto:hpages at fhcrc.org><mailto:__hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>>, "bioconductor at r-project.org
> <mailto:bioconductor at r-project.org><__mailto:bioconductor at r-project.__org
> <mailto:bioconductor at r-project.org>>"<bioconductor at r-project.__org
> <mailto:bioconductor at r-project.org><mailto:bioconductor at r-__project.org
> <mailto:bioconductor at r-project.org>>>
> Subject: Re: [Hinxton #251937] RE: [BioC]
> GenomicFeatures::__makeTranscriptDbFromBiomart - BioMart data
> anomaly: for some transcripts, the cds cumulative length
> inferred from the exon and UTR info doesn't match the
> "cds_length" attribute from BioMart
>
> Hi Malcolm and Hervé
> This appears to be a data issue with the Drosophila core
> database which was then propagated into BioMart. I have
> forwarded the issue to the Ensembl Genomes project as they
> maintain this database and they will respond as soon as possible.
> Regards
> Rhoda
>
>
> On 7 Feb 2012, at 21:35, Cook, Malcolm wrote:
>
> Herve, Thanks so much for digging into this.
>
> Rhonda, I had submitted a ticket as suggested to Ensembl
> helpdesk, and have included them as recipients to this message
> (after changing the subject to include the issue tracker number).
>
> Ensembl helpdesk, I expect that Herve's detailed report, below,
> provides an example of the reported data anomaly that will help
> resolve the underlying issue.
>
> Cheers,
>
> ~Malcolm
>
>
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>]
> Sent: Tuesday, February 07, 2012 2:37 PM
> To: Rhoda Kinsella; bioconductor at r-project.org
> <mailto:bioconductor at r-project.org><__mailto:bioconductor at r-project.__org
> <mailto:bioconductor at r-project.org>>
> Cc: Cook, Malcolm
> Subject: Re: [BioC] GenomicFeatures::__makeTranscriptDbFromBiomart -
> BioMart data anomaly: for some transcripts, the cds cumulative
> length
> inferred from the exon and UTR info doesn't match the "cds_length"
> attribute from BioMart
>
> Hi Rhoda, Malcolm, and others,
>
> So after taking a closer look at this, I can confirm that the
> reported
> "cds_length" looks wrong for some Fly transcripts. Take for example
> the FBtr0079414 transcript (minus strand):
>
> library(biomaRt)
> mart1<- useMart(biomart="ensembl",
> dataset="dmelanogaster_gene___ensembl")
> attributes<- c("ensembl_transcript_id", "strand",
> + "rank", "exon_chrom_start", "exon_chrom_end",
> + "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end",
> + "cds_length")
> filters<- "ensembl_transcript_id"
> values<- "FBtr0079414"
> getBM(attributes=attributes, filters=filters, values=values,
> mart=mart1)
> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
> 5_utr_start
> 1 FBtr0079414 -1 1 7218909 7220029
> 7219112
> 2 FBtr0079414 -1 2 7218643 7218853
> NA
> 5_utr_end 3_utr_start 3_utr_end cds_length
> 1 7220029 NA NA 204
> 2 NA 7218643 7218853 204
>
> 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no
> CDS on this exon. The start of the 5' UTR (located on exon 1) is 203
> bases upstream of the exon start. But the reported cds_length is
> 204.
> Something looks wrong.
>
> For other transcripts, e.g. FBtr0300689 (plus strand), things
> look OK:
>
> getBM(attributes=attributes, filters=filters, values=values,
> mart=mart1)
> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
> 5_utr_start
> 1 FBtr0300689 1 1 7529 8116
> 7529
> 2 FBtr0300689 1 2 8193 9484
> NA
> 5_utr_end 3_utr_start 3_utr_end cds_length
> 1 7679 NA NA 855
> 2 NA 8611 9484 855
>
> 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases
> upstream of the exon end. The start of the 3' UTR (located on
> exon 2)
> is 418 bases downstream of the exon start. So the CDS total
> length is
> 437 + 418 = 855, as reported.
>
> @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to
> commit a patch to this function so that this anomaly in the Ensembl
> data causes a warning instead of an error. Also the warning will
> display the first 6 affected transcripts. The patch will make it
> into
> GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will
> become
> available via biocLite() in the next 24-36 hours.
>
> Cheers,
> H.
>
>
> On 02/06/2012 02:18 PM, Hervé Pagès wrote:
> Hi Rhoda and others,
>
> I still need to check that this error issued by internal helper
> .__extractCdsRangesFromBiomartTab__le() about "the cds cumulative
> length inferred from the exon and UTR not matching the cds_length
> attribute from BioMart" is not a FALSE positive.
>
> I'm planning to patch the code in charge of this sanity check
> so it issues a warning instead of an error and it displays
> something more useful than just "for some transcripts etc...".
> Would be nice to know at least for which transcript.
>
> I'll keep you informed, thanks!
> H.
>
>
> On 02/06/2012 12:53 AM, Rhoda Kinsella wrote:
> Hi Malcolm and Marc,
> Please submit an Ensembl helpdesk ticket about this issue along
> with a
> detailed example to (helpdesk at ensembl.org
> <mailto:helpdesk at ensembl.org><mailto:h__elpdesk at ensembl.org
> <mailto:helpdesk at ensembl.org>>) and we will look into it.
> Kind regards
> Rhoda
>
>
> On 3 Feb 2012, at 20:32, Cook, Malcolm wrote:
>
> Hi Marc, and other `library(GenomicFeatures)` users working in fly,
>
> I just changed Subject to keep alive one of the issues I still have,
> namely:
>
> I get the following error:
>
> library(GenomicFeatures)
> txdb<-__makeTranscriptDbFromBiomart(__biomart="ensembl",
> dataset="dmelanogaster_gene___ensembl", circ_seqs=NULL))
> Download and preprocess the 'transcripts' data frame ... OK
> Download and preprocess the 'chrominfo' data frame ... OK
> Download and preprocess the 'splicings' data frame ... Error
> in .__extractCdsRangesFromBiomartTab__le(bm_table) :
> BioMart data anomaly: for some transcripts, the cds cumulative
> length inferred from the exon and UTR info doesn't match the
> "cds_length" attribute from BioMart
>
>
> Marc, you already observed that:
>
> the data for cds ranges and total cds length (both from biomaRt) no
> longer agree with each other. In other words, the data from the
> current
> drosophila ranges in biomaRt seems to disagree with itself, and
> so the
> code is refusing to make a package out of this data as a result.
> To get the 2nd issue fixed probably involves talking to ensembl
> about
> their CDS data for fly to see if we can resolve the discrepancy.
> I would be happy to take this to them.
>
> I still wonder:
>
> Can you recommend a best way to get a more diagnostic trace from the
> attempt at txdb creation so we can correctly report to ensembl team
> the
> errant transcript(s) ?
>
> I would be happy to take this up with Ensembl team, but, need
> details which I don't know how to produce.
>
>
> Finally, one the side, here is a tiny suggestion:
>
> * change the default for circ_seqs in makeTranscriptDbFromBiomart
> to be NULL, instead of any organism (human) specific.
>
> Regards,
>
> --Malcolm
>
>
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0
> [4] GenomicRanges_1.6.6 IRanges_1.12.5
>
> loaded via a namespace (and not attached):
> [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5
> RCurl_1.9-5
> [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0
> rtracklayer_1.14.4
> [9] tools_2.14.0 zlibbioc_1.0.0
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org><__mailto:Bioconductor at r-project.__org
> <mailto:Bioconductor at r-project.org>>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
> Rhoda Kinsella Ph.D.
> Ensembl Production Project Leader,
> European Bioinformatics Institute (EMBL-EBI),
> Wellcome Trust Genome Campus,
> Hinxton
> Cambridge CB10 1SD,
> UK.
>
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org><__mailto:Bioconductor at r-project.__org
> <mailto:Bioconductor at r-project.org>>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> <mailto:hpages at fhcrc.org><mailto:hpages__ at fhcrc.org
> <mailto:hpages at fhcrc.org>>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
> Rhoda Kinsella Ph.D.
> Ensembl Production Project Leader,
> European Bioinformatics Institute (EMBL-EBI),
> Wellcome Trust Genome Campus,
> Hinxton
> Cambridge CB10 1SD,
> UK.
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list