[Bioc-sig-seq] GenomicFeatures package
Steve Lianoglou
mailinglist.honeypot at gmail.com
Thu Oct 29 16:28:01 CET 2009
Hi,
On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote:
<snip>
>>>
>> It could be useful, but you'll need to get it into the shape
>> expected by the
>> GenomicFeatures functions. With specific regard to transcripts(),
>> it will
>> need a 'chrom' column with the chromosome names, 'name' column for
>> the gene
>> name, and 'txStart' and 'txEnd' for the start and end of the
>> transcripts,
>> using UCSC coordinate conventions.
>>
>> I'm now thinking that it would be more convenient for the user if
>> there was
>> a transcripts method on a RangesList object, which could provide the
>> necessary information. This could be extracted from the RangedData,
>> which
>> rtracklayer can create from your GFF file, with one caveat: if the
>> GFF file
>> contains a mixture of features (genes, exons, etc) and relies on the
>> hierarchical features of GFF, it will take more work to get things
>> into the
>> right shape.
>>
>> The question then is where would this new functionality belong? The
>> future
>> of the GenomicFeatures package was a bit uncertain, the last time I
>> checked.
>
> The intention is that GenomicFeatures will mature to contain these
> sorts of data structures and functions, so that it's easy to create or
> retrieve transcript or exon information.
</snip>
I've actually been working on a package of my own that does a lot of
this stuff, since I've been working lately with rna-seq data.
It takes an input file similar to what you get from the ucsc table
browser with gene, tx/cds -start/-end, exon, etc. columns and whips up
an annotation package for the genome/annotation source you are using
(info is stored in an SQLite database). You can then query the
database for genes of interest by name, chromosome, or bounded (start/
end) chromosome position for further downstream use.
Usually results are returned to the user as lists of gene objects,
which the user would have to whip into a data.frame format if that's
what's desired.
I'll post some code below to show you what I mean, and see if you all
think it's helpful. I'm not sure that it's ready for public
consumption, but I guess I just wanted to mention that it was my
intention to develop this a bit further in the near future. If it's
useful for other people, all the better.
Unfortunately I need to put much of this on hold for the immediate
future, since I have to prepare for my upcoming qualifying exam ...
I realize it somehow does similar things to what the GenomicFeatures
package might have done, but I needed this exact functionality for my
immediate analysis needs, and thought I'd have a go at it. The SQLite
database that's generated actually stores gene, transcript, and exon
info in RTree indices[1], so ranged queries (like all genes/
transcripts between some start/stop positions) are super fast. It
requires a custom compile of SQLite, though
Examples of how I use it are pasted below.
-steve
[1] RTree: http://www.sqlite.org/rtree.html
##############################
# Using with refseq annotations for hg18
R> library(GenomeAnnotations) # name of my library
R> hg18.refseq <- GenomeDB("hg18", "refseq")
R> d1r <- Gene("DICER1", genome=hg18.refseq)
## Getting transcript info
R> transcripts(d1r)
SimpleRangesList instance:
$NM_030621
IRanges instance:
start end width
[1] 94622320 94626752 4433
[2] 94627120 94627200 81
[3] 94627296 94627456 161
[4] 94629976 94630248 273
[5] 94631912 94632800 889
[6] 94635872 94636024 153
[7] 94639432 94640216 785
[8] 94641160 94641336 177
[9] 94641768 94641872 105
... ... ... ...
[20] 94660288 94660760 473
[21] 94662672 94662840 169
[22] 94665560 94665720 161
[23] 94666144 94666280 137
[24] 94667600 94667728 129
[25] 94668608 94668768 161
[26] 94669408 94669592 185
[27] 94676752 94676848 97
[28] 94677776 94677808 33
<1 additional element>
## UTR Info
R> getUtr3(d1r)
$NM_030621
IRanges instance:
start end width
[1] 94622320 94626587 4268
$NM_177438
IRanges instance:
start end width
[1] 94622320 94626587 4268
R> getUtr5(d1r)
$NM_030621
IRanges instance:
start end width
[1] 94669548 94669592 45
[2] 94676752 94676848 97
[3] 94677776 94677808 33
$NM_177438
IRanges instance:
start end width
[1] 94669548 94669592 45
[2] 94693320 94693512 193
## Get genes on chromosome 1
R> genes.1 <- getGenesOnChromosome(hg18.refseq, 1)
R> names(genes.1)[1:5]
[1] "WASH5P" "LOC100288778" "FAM138F" "FAM138A"
"FAM138C"
## Get genes on chr1 between pos 1e6 and 1e7
R> genes.1.sub <- getGenesOnChromosome(hg18.refseq, 1, start=1e6,
end=1e7)
R> names(genes.1.sub)[1:5]
[1] "C1orf159" "TTLL10" "TTLL10" "TNFRSF18" "TNFRSF18"
#########################################
# or, hg18 aceview annos
R> hg18.ace <- GenomeDB("hg18", "aceview")
R> d1a <- Gene("DICER1", genome=hg18.ace)
R> transcripts(d1a)
SimpleRangesList instance:
$DICER1.cApr07
IRanges instance:
start end width
[1] 94622320 94626752 4433
[2] 94627120 94627200 81
[3] 94627296 94627456 161
[4] 94629976 94630248 273
[5] 94631912 94632800 889
[6] 94635872 94636024 153
[7] 94639432 94640216 785
[8] 94641160 94641336 177
[9] 94641768 94641872 105
... ... ... ...
[19] 94653712 94653840 129
[20] 94660288 94660760 473
[21] 94662672 94662840 169
[22] 94665560 94665720 161
[23] 94666144 94666280 137
[24] 94667600 94667728 129
[25] 94668608 94668768 161
[26] 94669408 94669592 185
[27] 94693320 94693512 193
<10 additional elements>
R> getUtr3(d1a)
$DICER1.cApr07
IRanges instance:
start end width
[1] 94622320 94626587 4268
$DICER1.bApr07
IRanges instance:
start end width
[1] 94622320 94626587 4268
.... etc ...
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioc-sig-sequencing
mailing list