[Bioc-sig-seq] Pruning rows from a RangedData object
Simon Anders
anders at ebi.ac.uk
Fri Jul 17 15:39:23 CEST 2009
Hi Patrick et al.
do you have any suggestions on the following?
I've got an RangesList object 'exons0' that I created from a GFF file
for the human genome. This GFF file contains all transcripts, and lists
all those exons that appear in multiple transcripts multiple times. I
would like to filter them out.
I was pleased to see that you redefined 'duplicated' for RangedData,
which allowed me to find the rows in the IRanges object that are
duplicates. But how do I prune them?
My first try was this here:
dupeRows <- unlist( sapply( exons0, function(a)
duplicated(ranges(a)[[1]]) ) )
exons1 <- exons0[ dupeRows, ]
This seem to do the job:
> exons0
RangedData: 507249 ranges by 5 columns on 25 sequences
colnames(5): type source phase strand group
names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
chrX chrY
> dupeRows <- unlist( sapply( exons0, function(a)
+ duplicated(ranges(a)[[1]]) ) )
> exons1 <- exons0[ dupeRows, ]
> exons1
RangedData: 253143 ranges by 5 columns on 25 sequences
colnames(5): type source phase strand group
names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
chrX chrY
However, the resulting object behaves oddly. Compare:
> exons0["chr01"]
RangedData: 61840 ranges by 5 columns on 1 sequence
colnames(5): type source phase strand group
names(1): chr01
> exons1["chr01"]
Error in values[i] : mismatching names (and NULL elements not allowed)
What's going on here?
I've now used this command here instead, which does the job, but looks
quite unwieldy and is very slow:
exons <-
do.call( c, unname( lapply( exons0, function(a)
a[ !duplicated( ranges(a)[[1]] ), ] ) ) )
In case you want to try this yourself, you can find the 'exon0' object
here: http://www.ebi.ac.uk/~anders/tmp/exons0.rda
Cheers
Simon
More information about the Bioc-sig-sequencing
mailing list