[Bioc-sig-seq] uniqueFilter in the ShortRead package

Sun Feb 7 14:56:18 CET 2010

Hi Jason --

On 02/05/2010 11:30 AM, Jason Lu wrote:
> Hi,
> 
> I have been using the ShortRead package with my sequencing data. It has been
> making my life a lot easier.
> 
> One thing I noticed that the logic in the uniqueFilter function seems to be
> problematic.
> 
> The original function is:
> function (withSread = TRUE, .name = "UniqueFilter")
> {
>     .check_type_and_length(withSread, "logical", 1)
>     srFilter(function(x) {
>         if (withSread)
>             !srduplicated(x)
>         else {
>             !(duplicated(position(x)) & duplicated(strand(x)) &
>                 duplicated(chromosome(x)))
>         }
>     }, name = .name)
> }
> 
> If withSread = FALSE, the else part seems to filter out lots of reads I
> would like to keep.
> 
> My dumb solution is to have this change:
> !(duplicated(paste(position(x), strand(x), chromosome(x),sep=";")))
> 
> I may have misused the function though.

Technically the function works as documented but you're right that this
is not usually what one wants. I've implemented occurrenceFilter() in
the development version of ShortRead (look for version >= 1.5.14) which
does what you want using withSread=FALSE; it is also meant to do more
flexible filtering, e.g., reads represented >=min and <= max times, and
to treat sets of duplicate reads differently (e.g., ignoring all
duplicates, rather than keeping the first).

I deprecated uniqueFilter in the development branch, which means that it
still works but will be removed in a future release.

Thanks for the report.

Martin

>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-redhat-linux-gnu
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] ShortRead_1.4.0   lattice_0.17-26   BSgenome_1.14.2   Biostrings_2.14.8
> [5] IRanges_1.4.9
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.1 grid_2.10.0   hwriter_1.1   tools_2.10.0
>>
> 
> Thanks,
> Jason
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793