[Bioc-sig-seq] uniqueFilter in the ShortRead package - Integer overflow - bug fix

Martin Morgan mtmorgan at fhcrc.org
Mon Feb 8 19:18:38 CET 2010


On 02/08/2010 04:54 AM, Nora Rieber wrote:
> Dear Martin,
> 
> I was thrilled to discover the occurrenceFilter() function and I tried
> it out right away on my data.
> However I got this error & warning:
> 
>> aln_B[occurrenceFilter(withSread=FALSE, duplicates="head")(aln_B)]
> Error in if (sum(q) != 0L) { : Missing value where TRUE/FALSE needed
> Warning:
> In sum(q) : integer overflow - use sum(as.numeric(,))
> 
> 
> I've had a look and then modified the function code and it seems that using
> 
> if (sum(as.numeric(q)) != 0L)
> 
> instead of
> 
> if (sum(q) != 0L) {
> 
> does fix the problem. However I'm not sure whether this might modify the
> functionality in any unwanted way!

Hi Nora --

Glad that occurrenceFilter generated enthusiasm, but sorry for the
problems. This should be length(q) != 0L, and is fixed in ShortRead v.
1.5.15. Thanks for the report.

Martin

> 
> Best wishes,
> Nora
> 
>> Date: Sun, 07 Feb 2010 05:56:18 -0800
>> From: Martin Morgan <mtmorgan at fhcrc.org>
>> To: Jason Lu <jasonlu68 at gmail.com>
>> Cc: bioc-sig-sequencing at r-project.org
>> Subject: Re: [Bioc-sig-seq] uniqueFilter in the ShortRead package
>> Message-ID: <4B6EC682.4000401 at fhcrc.org>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> Hi Jason --
>>
>> On 02/05/2010 11:30 AM, Jason Lu wrote:
>>   
>>> Hi,
>>>
>>> I have been using the ShortRead package with my sequencing data. It has been
>>> making my life a lot easier.
>>>
>>> One thing I noticed that the logic in the uniqueFilter function seems to be
>>> problematic.
>>>
>>> The original function is:
>>> function (withSread = TRUE, .name = "UniqueFilter")
>>> {
>>>     .check_type_and_length(withSread, "logical", 1)
>>>     srFilter(function(x) {
>>>         if (withSread)
>>>             !srduplicated(x)
>>>         else {
>>>             !(duplicated(position(x)) & duplicated(strand(x)) &
>>>                 duplicated(chromosome(x)))
>>>         }
>>>     }, name = .name)
>>> }
>>>
>>> If withSread = FALSE, the else part seems to filter out lots of reads I
>>> would like to keep.
>>>
>>> My dumb solution is to have this change:
>>> !(duplicated(paste(position(x), strand(x), chromosome(x),sep=";")))
>>>
>>> I may have misused the function though.
>>>     
>>
>> Technically the function works as documented but you're right that this
>> is not usually what one wants. I've implemented occurrenceFilter() in
>> the development version of ShortRead (look for version >= 1.5.14) which
>> does what you want using withSread=FALSE; it is also meant to do more
>> flexible filtering, e.g., reads represented >=min and <= max times, and
>> to treat sets of duplicate reads differently (e.g., ignoring all
>> duplicates, rather than keeping the first).
>>
>> I deprecated uniqueFilter in the development branch, which means that it
>> still works but will be removed in a future release.
>>
>> Thanks for the report.
>>
>> Martin
>>
>>   
>>>> sessionInfo()
>>>>       
>>> R version 2.10.0 (2009-10-26)
>>> x86_64-redhat-linux-gnu
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] ShortRead_1.4.0   lattice_0.17-26   BSgenome_1.14.2   Biostrings_2.14.8
>>> [5] IRanges_1.4.9
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.6.1 grid_2.10.0   hwriter_1.1   tools_2.10.0
>>>     
>>> Thanks,
>>> Jason
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>     
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>>
>>   
>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list