[Bioc-sig-seq] uniqueFilter in the ShortRead package
Martin Morgan
mtmorgan at fhcrc.org
Sun Feb 7 14:56:18 CET 2010
Hi Jason --
On 02/05/2010 11:30 AM, Jason Lu wrote:
> Hi,
>
> I have been using the ShortRead package with my sequencing data. It has been
> making my life a lot easier.
>
> One thing I noticed that the logic in the uniqueFilter function seems to be
> problematic.
>
> The original function is:
> function (withSread = TRUE, .name = "UniqueFilter")
> {
> .check_type_and_length(withSread, "logical", 1)
> srFilter(function(x) {
> if (withSread)
> !srduplicated(x)
> else {
> !(duplicated(position(x)) & duplicated(strand(x)) &
> duplicated(chromosome(x)))
> }
> }, name = .name)
> }
>
> If withSread = FALSE, the else part seems to filter out lots of reads I
> would like to keep.
>
> My dumb solution is to have this change:
> !(duplicated(paste(position(x), strand(x), chromosome(x),sep=";")))
>
> I may have misused the function though.
Technically the function works as documented but you're right that this
is not usually what one wants. I've implemented occurrenceFilter() in
the development version of ShortRead (look for version >= 1.5.14) which
does what you want using withSread=FALSE; it is also meant to do more
flexible filtering, e.g., reads represented >=min and <= max times, and
to treat sets of duplicate reads differently (e.g., ignoring all
duplicates, rather than keeping the first).
I deprecated uniqueFilter in the development branch, which means that it
still works but will be removed in a future release.
Thanks for the report.
Martin
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-redhat-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ShortRead_1.4.0 lattice_0.17-26 BSgenome_1.14.2 Biostrings_2.14.8
> [5] IRanges_1.4.9
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.1 grid_2.10.0 hwriter_1.1 tools_2.10.0
>>
>
> Thanks,
> Jason
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-sig-sequencing
mailing list