[Bioc-sig-seq] Pruning rows from a RangedData object

Fri Jul 17 20:09:17 CEST 2009

Simon,
You found a bug. I have fixed it in BioC 2.5 (devel) and it will be 
available from bioconductor.org in 24 hours, but you can get it from svn 
now.

I think the rdapply function was intended for performing filtering on 
RangedData objects, but I haven't used it too much.

 > suppressMessages(library(IRanges))
 > load("exons0.rda")
 > dupeRows <- unlist(sapply(exons0, function(a)
+   duplicated(ranges(a)[[1]])))
 > exons1 <- exons0[ dupeRows, ]
 > exons1["chr01"]
RangedData: 34312 ranges by 5 columns on 1 sequence
colnames(5): type source phase strand group
names(1): chr01
 > sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] IRanges_1.3.38

Simon Anders wrote:
> Hi Patrick et al.
>
> do you have any suggestions on the following?
>
> I've got an RangesList object 'exons0' that I created from a GFF file 
> for the human genome. This GFF file contains all transcripts, and 
> lists all those exons that appear in multiple transcripts multiple 
> times. I would like to filter them out.
>
> I was pleased to see that you redefined 'duplicated' for RangedData, 
> which allowed me to find the rows in the IRanges object that are 
> duplicates. But how do I prune them?
>
> My first try was this here:
>
>   dupeRows <- unlist( sapply( exons0, function(a)
>      duplicated(ranges(a)[[1]]) ) )
>   exons1 <- exons0[ dupeRows, ]
>
> This seem to do the job:
>
>   > exons0
>   RangedData: 507249 ranges by 5 columns on 25 sequences
>   colnames(5): type source phase strand group
>   names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
>   chrX chrY
>
>   > dupeRows <- unlist( sapply( exons0, function(a)
>   +    duplicated(ranges(a)[[1]]) ) )
>   > exons1 <- exons0[ dupeRows, ]
>
>   > exons1
>   RangedData: 253143 ranges by 5 columns on 25 sequences
>   colnames(5): type source phase strand group
>   names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
>   chrX chrY
>
> However, the resulting object behaves oddly. Compare:
>
>   > exons0["chr01"]
>   RangedData: 61840 ranges by 5 columns on 1 sequence
>   colnames(5): type source phase strand group
>   names(1): chr01
>
>   > exons1["chr01"]
>   Error in values[i] : mismatching names (and NULL elements not allowed)
>
> What's going on here?
>
>
> I've now used this command here instead, which does the job, but looks 
> quite unwieldy and is very slow:
>
>   exons <-
>   do.call( c, unname( lapply( exons0, function(a)
>      a[ !duplicated( ranges(a)[[1]] ), ] ) ) )
>
>
> In case you want to try this yourself, you can find the 'exon0' object 
> here: http://www.ebi.ac.uk/~anders/tmp/exons0.rda
>
>
> Cheers
>   Simon
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing