[Bioc-sig-seq] GenomicFeatures package

Thu Oct 29 16:28:01 CET 2009

Hi,

On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote:
<snip>
>>>
>> It could be useful, but you'll need to get it into the shape  
>> expected by the
>> GenomicFeatures functions. With specific regard to transcripts(),  
>> it will
>> need a 'chrom' column with the chromosome names, 'name' column for  
>> the gene
>> name, and 'txStart' and 'txEnd' for the start and end of the  
>> transcripts,
>> using UCSC coordinate conventions.
>>
>> I'm now thinking that it would be more convenient for the user if  
>> there was
>> a transcripts method on a RangesList object, which could provide the
>> necessary information. This could be extracted from the RangedData,  
>> which
>> rtracklayer can create from your GFF file, with one caveat: if the  
>> GFF file
>> contains a mixture of features (genes, exons, etc) and relies on the
>> hierarchical features of GFF, it will take more work to get things  
>> into the
>> right shape.
>>
>> The question then is where would this new functionality belong? The  
>> future
>> of the GenomicFeatures package was a bit uncertain, the last time I  
>> checked.
>
> The intention is that GenomicFeatures will mature to contain these
> sorts of data structures and functions, so that it's easy to create or
> retrieve transcript or exon information.
</snip>

I've actually been working on a package of my own that does a lot of  
this stuff, since I've been working lately with rna-seq data.

It takes an input file similar to what you get from the ucsc table  
browser with gene, tx/cds -start/-end, exon, etc. columns and whips up  
an annotation package for the genome/annotation source you are using  
(info is stored in an SQLite database). You can then query the  
database for genes of interest by name, chromosome, or bounded (start/ 
end) chromosome position for further downstream use.

Usually results are returned to the user as lists of gene objects,  
which the user would have to whip into a data.frame format if that's  
what's desired.

I'll post some code below to show you what I mean, and see if you all  
think it's helpful. I'm not sure that it's ready for public  
consumption, but I guess I just wanted to mention that it was my  
intention to develop this a bit further in the near future. If it's  
useful for other people, all the better.

Unfortunately I need to put much of this on hold for the immediate  
future, since I have to prepare for my upcoming qualifying exam ...

I realize it somehow does similar things to what the GenomicFeatures  
package might have done, but I needed this exact functionality for my  
immediate analysis needs, and thought I'd have a go at it. The SQLite  
database that's generated actually stores gene, transcript, and exon  
info in RTree indices[1], so ranged queries (like all genes/ 
transcripts between some start/stop positions) are super fast. It  
requires a custom compile of SQLite, though

Examples of how I use it are pasted below.

-steve

[1] RTree: http://www.sqlite.org/rtree.html

##############################
# Using with refseq annotations for hg18

R> library(GenomeAnnotations) # name of my library
R> hg18.refseq <- GenomeDB("hg18", "refseq")
R> d1r <- Gene("DICER1", genome=hg18.refseq)

## Getting transcript info

R> transcripts(d1r)
SimpleRangesList instance:
$NM_030621
IRanges instance:
         start      end width
[1]  94622320 94626752  4433
[2]  94627120 94627200    81
[3]  94627296 94627456   161
[4]  94629976 94630248   273
[5]  94631912 94632800   889
[6]  94635872 94636024   153
[7]  94639432 94640216   785
[8]  94641160 94641336   177
[9]  94641768 94641872   105
...       ...      ...   ...
[20] 94660288 94660760   473
[21] 94662672 94662840   169
[22] 94665560 94665720   161
[23] 94666144 94666280   137
[24] 94667600 94667728   129
[25] 94668608 94668768   161
[26] 94669408 94669592   185
[27] 94676752 94676848    97
[28] 94677776 94677808    33

<1 additional element>

## UTR Info
R> getUtr3(d1r)
$NM_030621
IRanges instance:
        start      end width
[1] 94622320 94626587  4268

$NM_177438
IRanges instance:
        start      end width
[1] 94622320 94626587  4268

R> getUtr5(d1r)
$NM_030621
IRanges instance:
        start      end width
[1] 94669548 94669592    45
[2] 94676752 94676848    97
[3] 94677776 94677808    33

$NM_177438
IRanges instance:
        start      end width
[1] 94669548 94669592    45
[2] 94693320 94693512   193

## Get genes on chromosome 1
R> genes.1 <- getGenesOnChromosome(hg18.refseq, 1)
R> names(genes.1)[1:5]
[1] "WASH5P"       "LOC100288778" "FAM138F"      "FAM138A"       
"FAM138C"

## Get genes on chr1 between pos 1e6 and 1e7
R> genes.1.sub <- getGenesOnChromosome(hg18.refseq, 1, start=1e6,  
end=1e7)
R> names(genes.1.sub)[1:5]
[1] "C1orf159" "TTLL10"   "TTLL10"   "TNFRSF18" "TNFRSF18"

#########################################
# or, hg18 aceview annos
R> hg18.ace <- GenomeDB("hg18", "aceview")
R> d1a <- Gene("DICER1", genome=hg18.ace)
R> transcripts(d1a)
SimpleRangesList instance:
$DICER1.cApr07
IRanges instance:
         start      end width
[1]  94622320 94626752  4433
[2]  94627120 94627200    81
[3]  94627296 94627456   161
[4]  94629976 94630248   273
[5]  94631912 94632800   889
[6]  94635872 94636024   153
[7]  94639432 94640216   785
[8]  94641160 94641336   177
[9]  94641768 94641872   105
...       ...      ...   ...
[19] 94653712 94653840   129
[20] 94660288 94660760   473
[21] 94662672 94662840   169
[22] 94665560 94665720   161
[23] 94666144 94666280   137
[24] 94667600 94667728   129
[25] 94668608 94668768   161
[26] 94669408 94669592   185
[27] 94693320 94693512   193

<10 additional elements>

R> getUtr3(d1a)
$DICER1.cApr07
IRanges instance:
        start      end width
[1] 94622320 94626587  4268

$DICER1.bApr07
IRanges instance:
        start      end width
[1] 94622320 94626587  4268

.... etc ...

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact