[Bioc-sig-seq] GenomicFeatures package
Seth Falcon
sfalcon at fhcrc.org
Mon Nov 2 06:47:08 CET 2009
On 10/31/09 4:14 PM, Michael Lawrence wrote:
> On Sat, Oct 31, 2009 at 2:15 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
>
>> Hi Steve --
>>
>> Steve Lianoglou<mailinglist.honeypot at gmail.com> writes:
>>
>>> Hi,
>>>
>>> On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote:
>>> <snip>
>>>>>>
>>>>> It could be useful, but you'll need to get it into the shape
>>>>> expected by the
>>>>> GenomicFeatures functions. With specific regard to transcripts(),
>>>>> it will
>>>>> need a 'chrom' column with the chromosome names, 'name' column for
>>>>> the gene
>>>>> name, and 'txStart' and 'txEnd' for the start and end of the
>>>>> transcripts,
>>>>> using UCSC coordinate conventions.
>>>>>
>>>>> I'm now thinking that it would be more convenient for the user if
>>>>> there was
>>>>> a transcripts method on a RangesList object, which could provide the
>>>>> necessary information. This could be extracted from the RangedData,
>>>>> which
>>>>> rtracklayer can create from your GFF file, with one caveat: if the
>>>>> GFF file
>>>>> contains a mixture of features (genes, exons, etc) and relies on the
>>>>> hierarchical features of GFF, it will take more work to get things
>>>>> into the
>>>>> right shape.
>>>>>
>>>>> The question then is where would this new functionality belong? The
>>>>> future
>>>>> of the GenomicFeatures package was a bit uncertain, the last time I
>>>>> checked.
>>>>
>>>> The intention is that GenomicFeatures will mature to contain these
>>>> sorts of data structures and functions, so that it's easy to create or
>>>> retrieve transcript or exon information.
>>> </snip>
>>
>>> I've actually been working on a package of my own that does a lot of
>>> this stuff, since I've been working lately with rna-seq data.
>>>
>>> It takes an input file similar to what you get from the ucsc table
>>> browser with gene, tx/cds -start/-end, exon, etc. columns and whips up
>>> an annotation package for the genome/annotation source you are using
>>> (info is stored in an SQLite database). You can then query the
>>> database for genes of interest by name, chromosome, or bounded (start/
>>> end) chromosome position for further downstream use.
>>>
>>> Usually results are returned to the user as lists of gene objects,
>>> which the user would have to whip into a data.frame format if that's
>>> what's desired.
>>>
>>> I'll post some code below to show you what I mean, and see if you all
>>> think it's helpful. I'm not sure that it's ready for public
>>> consumption, but I guess I just wanted to mention that it was my
>>> intention to develop this a bit further in the near future. If it's
>>> useful for other people, all the better.
>>
>> Just wanted to acknowledge this email. I like how straight-forward
>> your interface is, and the use of appropriate classes. RTree and the
>> custom compile is a little problematic;
>
>
> I'm wondering if the custom compile is really necessary? Is it not possible
> for rsqlite to provide that feature by default?
As far as I know, recent versions of RSQLite have the RTree support
enabled. The following, taken from the RTree documentations
(http://www.sqlite.org/rtree.html), seems to work (RSQLite_0.7-3 DBI_0.2-4):
library("RSQLite")
sql = "
CREATE VIRTUAL TABLE demo_index USING rtree(
id, -- Integer primary key
minX, maxX, -- Minimum and maximum X coordinate
minY, maxY -- Minimum and maximum Y coordinate
)
"
db <- dbConnect(SQLite())
dbGetQuery(db, sql)
ins1 <- "
INSERT INTO demo_index VALUES(
1, -- Primary key
-80.7749, -80.7747, -- Longitude range
30.3776, 30.3778 -- Latitude range
)
"
ins2 <- "
INSERT INTO demo_index VALUES(
2,
-81.0, -79.6,
35.0, 36.2
)
"
dbGetQuery(db, ins1)
dbGetQuery(db, ins2)
sqlq <- "SELECT id FROM demo_index WHERE minX>=-81.08 AND maxX<=-80.58
AND minY>=30.00 AND maxY<=35.44"
dbGetQuery(db, sqlq)
In any case, it would be great to continue the conversation.
+ seth
More information about the Bioc-sig-sequencing
mailing list