[Bioc-sig-seq] Why are there non-imformative names() from GenomicFeatures:::exonsBy(...)?
Patrick Aboyoun
paboyoun at fhcrc.org
Wed Aug 4 23:45:48 CEST 2010
Actually, adding making by accept "tx_name" would be cleaner.
exonsBy(txdb, by = c("tx", "tx_name", "gene"))
cdsBy(txdb, by = c("tx", "tx_name", "gene"))
Patrick
On 8/4/10 2:37 PM, Patrick Aboyoun wrote:
> That being said, we could still add a flag (useTxNames=FALSE?) to the
> exonsBy function that, if TRUE, would check to see if the transcript
> names are suitable, and if they are suitable, it would return the
> output you are looking for, and if they are not suitable, the function
> would throw and error stating the transcript names cannot be used for
> this purpose. For Steve use case, using transcript names would work
> just fine:
>
> > txdb <- makeTranscriptDbFromUCSC(genome='hg18', tablename='ensGene')
>
> > tx <- transcripts(txdb)
>
> > length(tx)
> [1] 63280
>
> > table(nchar(values(tx)[["tx_name"]]))
> 15
> 63280
>
> > table(duplicated(values(tx)[["tx_name"]]))
> FALSE
> 63280
>
> Cheers,
> Patrick
>
>
>
> On 8/4/10 10:39 AM, Marc Carlson wrote:
>> Hi Steve,
>>
>> Patrick is correct about the state of transcript IDs. We require them
>> to be unique, and since we cannot always rely on them being unique and
>> assigned from some sources, we have to assign them ourselves internally
>> to be certain. You can still recover the original names using the
>> transcripts() method (which would be the xscripts object in your
>> example), and then you would know which of those map based on the
>> tx_ids.
>>
>> When we initially mocked this up we were using the transcript IDs
>> presented by the sources, but we ran into too many circumstances where
>> we could not rely on them to be unique. The present solution is
>> slightly less fun to use, but safer for the end user. Internally, we
>> are using the tx_id as a primary key and so the user can have a better
>> guarantee that it will behave itself than if we tried to use the other
>> IDs. A similar thing had to be done for exon ids and cds ids since not
>> everyone even names those with an ID. Sometimes all your get is a
>> unique combination of ranges associated with a gene ID.
>>
>>
>> Marc
>>
>>
>>
>> On 08/04/2010 08:38 AM, Patrick Aboyoun wrote:
>>> Steve,
>>> Herve has better knowledge of the transcript names key than I do since
>>> he was dealing with the pre-processed data from UCSC and BioMart, but
>>> it is my understanding that there is no guarantee that the transcript
>>> names assigned by these resources are unique or even exist so using
>>> the transcript names can be problematic. We could add a flag
>>> (useTxNames=FALSE?) to the exonsBy function that, if TRUE, would check
>>> to see if the transcript names are suitable, and if they are suitable,
>>> it would return the output you are looking for, and if they are not
>>> suitable, the function would throw and error stating the transcript
>>> names cannot be used for this purpose. Thoughts?
>>>
>>>
>>> Cheers,
>>> Patrick
>>>
>>>
>>>
>>> On 8/4/10 6:20 AM, Steve Lianoglou wrote:
>>>> Howdy,
>>>>
>>>> For what it's worth, I'm trying to hack on GenomicFeatures in a
>>>> non-intrusive so that I can (i) wrap some functionality in a way that
>>>> I think is useful/intuitive, and (ii) hopefully contribute back to the
>>>> package itself.
>>>>
>>>> I addressed the "problem" in my original email by tweaking the
>>>> .featuresBy function here:
>>>>
>>>> http://github.com/lianos/GenomicFeaturesX/blob/master/R/transcriptsBy.R#L56
>>>>
>>>>
>>>>
>>>> (around lines 56, and again 93)
>>>>
>>>> As I'm a noob with this package, I'm not sure if it will aversely
>>>> effect anything else down stream, but it's doing what I need for now,
>>>> so that exonsBy(txdb, 'tx') now has the transcript id's in the names()
>>>> slot of the GRangesList that is returned.
>>>>
>>>> If it pleases the court, I'd also like to S4-ize the `exons` and
>>>> `transcripts` functions (which currently work on TranscriptDb's) since
>>>> I'd like to use them as accessors for my Gene-like objects:
>>>>
>>>> http://github.com/lianos/GenomicFeaturesX/blob/master/R/Gene-class.R
>>>>
>>>> Cheers,
>>>> -steve
>>>>
>>>> On Tue, Aug 3, 2010 at 7:33 PM, Steve Lianoglou
>>>> <mailinglist.honeypot at gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm seeing if I can transition some of my code from my
>>>>> (previously-mentioned) GenomeAnnotations package to use
>>>>> GenomicFeatures, so I have a few questions of how certain things
>>>>> should be done w/ GenomicFeatures.
>>>>>
>>>>> I'll start with this one:
>>>>>
>>>>> Shouldn't the GRangesList returned by exonsBy(txdb, 'tx') have more
>>>>> informative names than:
>>>>> "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
>>>>> ?
>>>>>
>>>>> I've made a TranscriptDb from the 'ensembl' gene annos, and I'd like
>>>>> to reconcile which "transcript exons" belong to which transcripts,
>>>>> but
>>>>> it seems there's no direct link from the GRangesList returned by
>>>>> exonsBy(..., 'tx') and the GRanges object returned by
>>>>> transcripts(txdb).
>>>>>
>>>>> Shouldn't there be? One would expect a 1:1 map, no? The length of the
>>>>> exonsBy/GRangesList is the same as transcripts/GRanges object, so ...
>>>>> yes :-).
>>>>>
>>>>> I mean:
>>>>>
>>>>> R> txdb<- makeTranscriptDbFromUCSC(genome='hg18',
>>>>> tablename='ensGene')
>>>>> R> txdb
>>>>> TranscriptDb object:
>>>>> | Db type: TranscriptDb
>>>>> | Data source: UCSC
>>>>> | Genome: hg18
>>>>> | UCSC Table: ensGene
>>>>> | Type of Gene ID: Ensembl gene ID
>>>>> | Full dataset: yes
>>>>> | transcript_nrow: 63280
>>>>> | exon_nrow: 276075
>>>>> | cds_nrow: 225373
>>>>> | Db created by: GenomicFeatures package from Bioconductor
>>>>> | Creation time: 2010-08-03 16:50:06 -0400 (Tue, 03 Aug 2010)
>>>>> | GenomicFeatures version at creation time: 1.0.6
>>>>> | RSQLite version at creation time: 0.9-2
>>>>>
>>>>> R> xcripts<- transcripts(txdb)
>>>>> R> xcripts
>>>>> xcripts
>>>>> GRanges with 63280 ranges and 2 elementMetadata values
>>>>> seqnames ranges strand | tx_id
>>>>> tx_name
>>>>> <Rle> <IRanges> <Rle>
>>>>> |<integer> <character>
>>>>> [1] chr1 [ 1873, 3533] + | 1730
>>>>> ENST00000404059
>>>>> [2] chr1 [ 20229, 20366] + | 1732
>>>>> ENST00000408384
>>>>> ...
>>>>>
>>>>> R> xcripts.exons<- exonsBy(txdb, 'tx')
>>>>> R> head(names(xcripts.exons))
>>>>> [1] "1" "2" "3" "4" "5" "6"
>>>>>
>>>>> I feel like names(xcripts.exons) should give me something like:
>>>>> "ENST00000404059" "ENST00000408384" "ENST00000359752" .... (in
>>>>> correct
>>>>> order, of course)
>>>>>
>>>>> My same confusion has to do lack of informative names returned from
>>>>> exonsBy(txdb, 'gene') -- I feel like it should set the names in the
>>>>> same way that transcriptsBy(txdb, 'gene').
>>>>>
>>>>> So, I wonder if this is just an oversight, or is it not there on
>>>>> purpose and I have to rethink the way I approach these problems.
>>>>> Should I be findOverlap-ing between my xcripts.exons GRangesList and
>>>>> my xcripts GRanges, or something? And if so, why is that better?
>>>>>
>>>>> R> sessionInfo()
>>>>> R version 2.12.0 Under development (unstable) (2010-06-28 r52408)
>>>>> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>
>>>>> other attached packages:
>>>>> [1] RSQLite_0.9-2 DBI_0.2-5 GenomicFeatures_1.1.6
>>>>> GenomicRanges_1.1.20
>>>>> [5] IRanges_1.7.15
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biobase_2.9.0 biomaRt_2.5.1 Biostrings_2.17.27
>>>>> BSgenome_1.17.6 RCurl_1.4-3
>>>>> [6] rtracklayer_1.9.4 tools_2.12.0 XML_3.1-0
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -steve
>>>>> --
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>> | Memorial Sloan-Kettering Cancer Center
>>>>> | Weill Medical College of Cornell University
>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list