[Bioc-sig-seq] Why are there non-imformative names() from GenomicFeatures:::exonsBy(...)?

Wed Aug 4 23:45:48 CEST 2010

Actually, adding making by accept "tx_name" would be cleaner.

        exonsBy(txdb, by = c("tx", "tx_name", "gene"))
        cdsBy(txdb, by = c("tx", "tx_name", "gene"))

Patrick

On 8/4/10 2:37 PM, Patrick Aboyoun wrote:
> That being said, we could still add a flag (useTxNames=FALSE?) to the 
> exonsBy function that, if TRUE, would check to see if the transcript 
> names are suitable, and if they are suitable, it would return the 
> output you are looking for, and if they are not suitable, the function 
> would throw and error stating the transcript names cannot be used for 
> this purpose. For Steve use case, using transcript names would work 
> just fine:
>
> > txdb <- makeTranscriptDbFromUCSC(genome='hg18', tablename='ensGene')
>
> > tx <- transcripts(txdb)
>
> > length(tx)
> [1] 63280
>
> > table(nchar(values(tx)[["tx_name"]]))
>    15
> 63280
>
> > table(duplicated(values(tx)[["tx_name"]]))
> FALSE
> 63280
>
> Cheers,
> Patrick
>
>
>
> On 8/4/10 10:39 AM, Marc Carlson wrote:
>> Hi Steve,
>>
>> Patrick is correct about the state of transcript IDs.  We require them
>> to be unique, and since we cannot always rely on them being unique and
>> assigned from some sources, we have to assign them ourselves internally
>> to be certain.  You can still recover the original names using the
>> transcripts() method (which would be the xscripts object in your
>> example), and then you would know which of those map based on the 
>> tx_ids.
>>
>> When we initially mocked this up we were using the transcript IDs
>> presented by the sources, but we ran into too many circumstances where
>> we could not rely on them to be unique.  The present solution is
>> slightly less fun to use, but safer for the end user.  Internally, we
>> are using the tx_id as a primary key and so the user can have a better
>> guarantee that it will behave itself than if we tried to use the other
>> IDs.  A similar thing had to be done for exon ids and cds ids since not
>> everyone even names those with an ID.  Sometimes all your get is a
>> unique combination of ranges associated with a gene ID.
>>
>>
>>    Marc
>>
>>
>>
>> On 08/04/2010 08:38 AM, Patrick Aboyoun wrote:
>>> Steve,
>>> Herve has better knowledge of the transcript names key than I do since
>>> he was dealing with the pre-processed data from UCSC and BioMart, but
>>> it is my understanding that there is no guarantee that the transcript
>>> names assigned by these resources are unique or even exist so using
>>> the transcript names can be problematic. We could add a flag
>>> (useTxNames=FALSE?) to the exonsBy function that, if TRUE, would check
>>> to see if the transcript names are suitable, and if they are suitable,
>>> it would return the output you are looking for, and if they are not
>>> suitable, the function would throw and error stating the transcript
>>> names cannot be used for this purpose. Thoughts?
>>>
>>>
>>> Cheers,
>>> Patrick
>>>
>>>
>>>
>>> On 8/4/10 6:20 AM, Steve Lianoglou wrote:
>>>> Howdy,
>>>>
>>>> For what it's worth, I'm trying to hack on GenomicFeatures in a
>>>> non-intrusive so that I can (i) wrap some functionality in a way that
>>>> I think is useful/intuitive, and (ii) hopefully contribute back to the
>>>> package itself.
>>>>
>>>> I addressed the "problem" in my original email by tweaking the
>>>> .featuresBy function here:
>>>>
>>>> http://github.com/lianos/GenomicFeaturesX/blob/master/R/transcriptsBy.R#L56 
>>>>
>>>>
>>>>
>>>> (around lines 56, and again 93)
>>>>
>>>> As I'm a noob with this package, I'm not sure if it will aversely
>>>> effect anything else down stream, but it's doing what I need for now,
>>>> so that exonsBy(txdb, 'tx') now has the transcript id's in the names()
>>>> slot of the GRangesList that is returned.
>>>>
>>>> If it pleases the court, I'd also like to S4-ize the `exons` and
>>>> `transcripts` functions (which currently work on TranscriptDb's) since
>>>> I'd like to use them as accessors for my Gene-like objects:
>>>>
>>>> http://github.com/lianos/GenomicFeaturesX/blob/master/R/Gene-class.R
>>>>
>>>> Cheers,
>>>> -steve
>>>>
>>>> On Tue, Aug 3, 2010 at 7:33 PM, Steve Lianoglou
>>>> <mailinglist.honeypot at gmail.com>   wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm seeing if I can transition some of my code from my
>>>>> (previously-mentioned) GenomeAnnotations package to use
>>>>> GenomicFeatures, so I have a few questions of how certain things
>>>>> should be done w/ GenomicFeatures.
>>>>>
>>>>> I'll start with this one:
>>>>>
>>>>> Shouldn't the GRangesList returned by exonsBy(txdb, 'tx') have more
>>>>> informative names than:
>>>>> "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
>>>>> ?
>>>>>
>>>>> I've made a TranscriptDb from the 'ensembl' gene annos, and I'd like
>>>>> to reconcile which "transcript exons" belong to which transcripts, 
>>>>> but
>>>>> it seems there's no direct link from the GRangesList returned by
>>>>> exonsBy(..., 'tx') and the GRanges object returned by
>>>>> transcripts(txdb).
>>>>>
>>>>> Shouldn't there be? One would expect a 1:1 map, no? The length of the
>>>>> exonsBy/GRangesList is the same as transcripts/GRanges object, so ...
>>>>> yes :-).
>>>>>
>>>>> I mean:
>>>>>
>>>>> R>   txdb<- makeTranscriptDbFromUCSC(genome='hg18', 
>>>>> tablename='ensGene')
>>>>> R>   txdb
>>>>> TranscriptDb object:
>>>>> | Db type: TranscriptDb
>>>>> | Data source: UCSC
>>>>> | Genome: hg18
>>>>> | UCSC Table: ensGene
>>>>> | Type of Gene ID: Ensembl gene ID
>>>>> | Full dataset: yes
>>>>> | transcript_nrow: 63280
>>>>> | exon_nrow: 276075
>>>>> | cds_nrow: 225373
>>>>> | Db created by: GenomicFeatures package from Bioconductor
>>>>> | Creation time: 2010-08-03 16:50:06 -0400 (Tue, 03 Aug 2010)
>>>>> | GenomicFeatures version at creation time: 1.0.6
>>>>> | RSQLite version at creation time: 0.9-2
>>>>>
>>>>> R>   xcripts<- transcripts(txdb)
>>>>> R>   xcripts
>>>>> xcripts
>>>>> GRanges with 63280 ranges and 2 elementMetadata values
>>>>>          seqnames               ranges strand   |     tx_id
>>>>> tx_name
>>>>> <Rle> <IRanges> <Rle>
>>>>> |<integer> <character>
>>>>>      [1]     chr1     [  1873,   3533]      +   |      1730
>>>>> ENST00000404059
>>>>>      [2]     chr1     [ 20229,  20366]      +   |      1732
>>>>> ENST00000408384
>>>>> ...
>>>>>
>>>>> R>   xcripts.exons<- exonsBy(txdb, 'tx')
>>>>> R>   head(names(xcripts.exons))
>>>>> [1] "1" "2" "3" "4" "5" "6"
>>>>>
>>>>> I feel like names(xcripts.exons) should give me something like:
>>>>> "ENST00000404059" "ENST00000408384" "ENST00000359752" .... (in 
>>>>> correct
>>>>> order, of course)
>>>>>
>>>>> My same confusion has to do lack of informative names returned from
>>>>> exonsBy(txdb, 'gene') -- I feel like it should set the names in the
>>>>> same way that transcriptsBy(txdb, 'gene').
>>>>>
>>>>> So, I wonder if this is just an oversight, or is it not there on
>>>>> purpose and I have to rethink the way I approach these problems.
>>>>> Should I be findOverlap-ing between my xcripts.exons GRangesList and
>>>>> my xcripts GRanges, or something? And if so, why is that better?
>>>>>
>>>>> R>   sessionInfo()
>>>>> R version 2.12.0 Under development (unstable) (2010-06-28 r52408)
>>>>> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] RSQLite_0.9-2         DBI_0.2-5             GenomicFeatures_1.1.6
>>>>> GenomicRanges_1.1.20
>>>>> [5] IRanges_1.7.15
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biobase_2.9.0      biomaRt_2.5.1      Biostrings_2.17.27
>>>>> BSgenome_1.17.6    RCurl_1.4-3
>>>>> [6] rtracklayer_1.9.4  tools_2.12.0       XML_3.1-0
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -steve
>>>>> -- 
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>>    | Memorial Sloan-Kettering Cancer Center
>>>>>    | Weill Medical College of Cornell University
>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing