[Bioc-sig-seq] Pruning rows from a RangedData object

Fri Jul 17 15:39:23 CEST 2009

Hi Patrick et al.

do you have any suggestions on the following?

I've got an RangesList object 'exons0' that I created from a GFF file 
for the human genome. This GFF file contains all transcripts, and lists 
all those exons that appear in multiple transcripts multiple times. I 
would like to filter them out.

I was pleased to see that you redefined 'duplicated' for RangedData, 
which allowed me to find the rows in the IRanges object that are 
duplicates. But how do I prune them?

My first try was this here:

   dupeRows <- unlist( sapply( exons0, function(a)
      duplicated(ranges(a)[[1]]) ) )
   exons1 <- exons0[ dupeRows, ]

This seem to do the job:

   > exons0
   RangedData: 507249 ranges by 5 columns on 25 sequences
   colnames(5): type source phase strand group
   names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
   chrX chrY

   > dupeRows <- unlist( sapply( exons0, function(a)
   +    duplicated(ranges(a)[[1]]) ) )
   > exons1 <- exons0[ dupeRows, ]

   > exons1
   RangedData: 253143 ranges by 5 columns on 25 sequences
   colnames(5): type source phase strand group
   names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
   chrX chrY

However, the resulting object behaves oddly. Compare:

   > exons0["chr01"]
   RangedData: 61840 ranges by 5 columns on 1 sequence
   colnames(5): type source phase strand group
   names(1): chr01

   > exons1["chr01"]
   Error in values[i] : mismatching names (and NULL elements not allowed)

What's going on here?

I've now used this command here instead, which does the job, but looks 
quite unwieldy and is very slow:

   exons <-
   do.call( c, unname( lapply( exons0, function(a)
      a[ !duplicated( ranges(a)[[1]] ), ] ) ) )

In case you want to try this yourself, you can find the 'exon0' object 
here: http://www.ebi.ac.uk/~anders/tmp/exons0.rda

Cheers
   Simon