[Bioc-sig-seq] Pruning rows from a RangedData object
Patrick Aboyoun
paboyoun at fhcrc.org
Sat Jul 18 18:09:51 CEST 2009
Simon,
I found a faster approach to obtaining the duplicate rows by ranges
within space of RangedData objects. Rather than loop over the RangedData
object x, it is faster to loop over the RangesList ranges(x).
> suppressMessages(library(IRanges))
> load("exons0.rda")
> system.time({
+ dupeRows1 <- unlist(sapply(exons0, function(a)
+ duplicated(ranges(a)[[1]])))
+ exons1 <- exons0[dupeRows1, ]
+ })
user system elapsed
12.848 2.368 15.327
> system.time({
+ dupeRows2 <- unlist(lapply(ranges(exons0), duplicated))
+ exons2 <- exons0[dupeRows2, ]
+ })
user system elapsed
3.768 0.616 4.502
> identical(exons1, exons2)
[1] TRUE
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] IRanges_1.3.39
Patrick
Patrick Aboyoun wrote:
> Simon,
> You found a bug. I have fixed it in BioC 2.5 (devel) and it will be
> available from bioconductor.org in 24 hours, but you can get it from
> svn now.
>
> I think the rdapply function was intended for performing filtering on
> RangedData objects, but I haven't used it too much.
>
>
> > suppressMessages(library(IRanges))
> > load("exons0.rda")
> > dupeRows <- unlist(sapply(exons0, function(a)
> + duplicated(ranges(a)[[1]])))
> > exons1 <- exons0[ dupeRows, ]
> > exons1["chr01"]
> RangedData: 34312 ranges by 5 columns on 1 sequence
> colnames(5): type source phase strand group
> names(1): chr01
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
> i386-apple-darwin9.7.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] IRanges_1.3.38
>
>
>
>
> Simon Anders wrote:
>> Hi Patrick et al.
>>
>> do you have any suggestions on the following?
>>
>> I've got an RangesList object 'exons0' that I created from a GFF file
>> for the human genome. This GFF file contains all transcripts, and
>> lists all those exons that appear in multiple transcripts multiple
>> times. I would like to filter them out.
>>
>> I was pleased to see that you redefined 'duplicated' for RangedData,
>> which allowed me to find the rows in the IRanges object that are
>> duplicates. But how do I prune them?
>>
>> My first try was this here:
>>
>> dupeRows <- unlist( sapply( exons0, function(a)
>> duplicated(ranges(a)[[1]]) ) )
>> exons1 <- exons0[ dupeRows, ]
>>
>> This seem to do the job:
>>
>> > exons0
>> RangedData: 507249 ranges by 5 columns on 25 sequences
>> colnames(5): type source phase strand group
>> names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
>> chrX chrY
>>
>> > dupeRows <- unlist( sapply( exons0, function(a)
>> + duplicated(ranges(a)[[1]]) ) )
>> > exons1 <- exons0[ dupeRows, ]
>>
>> > exons1
>> RangedData: 253143 ranges by 5 columns on 25 sequences
>> colnames(5): type source phase strand group
>> names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
>> chrX chrY
>>
>> However, the resulting object behaves oddly. Compare:
>>
>> > exons0["chr01"]
>> RangedData: 61840 ranges by 5 columns on 1 sequence
>> colnames(5): type source phase strand group
>> names(1): chr01
>>
>> > exons1["chr01"]
>> Error in values[i] : mismatching names (and NULL elements not allowed)
>>
>> What's going on here?
>>
>>
>> I've now used this command here instead, which does the job, but
>> looks quite unwieldy and is very slow:
>>
>> exons <-
>> do.call( c, unname( lapply( exons0, function(a)
>> a[ !duplicated( ranges(a)[[1]] ), ] ) ) )
>>
>>
>> In case you want to try this yourself, you can find the 'exon0'
>> object here: http://www.ebi.ac.uk/~anders/tmp/exons0.rda
>>
>>
>> Cheers
>> Simon
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list