[BioC] GenomicRanges::nearest (especially follow)
Harris A. Jaffee
hj at jhu.edu
Tue Oct 16 18:25:28 CEST 2012
Fantastic, thank you. I'm not sure I knew about that page!
For example, ?precede gives me the IRanges nearest-methods page,
and only if I explicitly request ?'nearest-methods' can I choose
the GenomicRanges page. That brings up another issue: ?distance
again sends me to the IRanges nearest-methods help, and that only
documents distanceToNearest, which is not enough. The spec for
distance actually lives only on
?'precede,GenomicRanges,GenomicRanges-method', as far as I see.
Finally, and admittedly a total edge case, but zero-width ranges
appear to be mishandled, or not handled, by the IRanges distance
code. I think they should be infinitely far from anything, or at
least NA, and there should not be any nearest or distanceToNearest
value for a zero-width query.
> x
IRanges of length 1
start end width
[1] 2 1 0
> distance(x, x)
[1] 1
> y
IRanges of length 1
start end width
[1] 3 2 0
> distance(x, y)
[1] 2
> z
IRanges of length 1
start end width
[1] 3 3 1
> distance(x, z)
[1] 2
On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote:
> A few follow up items from an off-list conversation with Harris.
>
> 1) I reversed the use of 5' and3' in my response below. Sorry about that.
>
> 2) The primary landing page for ?precede is in IRanges. I've expanded the \seealso section of this page to point the user to the man page in GenomicRanges.
>
> 3) The precede/follow man page in GenomicRanges now specifically states that 5' to 3' is the relevant orientation for precede/follow. For those interested in taking a look, the man page has several aliases,
>
> ?'nearest-methods'
> ?'precede,GenomicRanges,GenomicRanges-method'
> ?'follow,GenomicRanges,GenomicRanges-method'
>
> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4.
>
> Thanks Harris.
>
> Valerie
>
> On 10/12/2012 04:17 PM, Valerie Obenchain wrote:
>> Hi Harris,
>>
>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote:
>>> Apologies in advance if I just don't understand the ignore.strand switch, or
>>> perhaps GRanges objects. Also, I have not even tried to understand
>>>
>>> GenomicRanges:::.GenomicRanges_findPrecedeFollow
>>>
>>> I have the impression that a query with strand entirely "*" essentially implies
>>> ignore.strand, but this case here is handled correctly only if ignore.strand is
>>> TRUE (consistent with distanceToNearest but not distance, which seems correct):
>>>
>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*")
>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-"))
>>>> distance(x, Y[1])
>>> [1] 1
>>>> distance(x, Y[2])
>>> [1] 2
>>>
>>>> nearest(x, Y, ignore.strand=TRUE) # correct
>>> [1] 1
>>>> nearest(ranges(x), ranges(Y)) # also correct
>>> [1] 1
>>>
>>> However,
>>>
>>>> nearest(x, Y)
>>> [1] 2
>>
>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 1.10.2.
>>
>> The problem was in .nearest() where I computed the preceding and following distance of the ranges. I was missing abs() so the distance for the second range was -2 instead of 2. When checking to see which range was closest
>>
>> > 1 < -2
>> [1] FALSE
>>
>> so the second range won.
>>>> distanceToNearest(x, Y)
>>> DataFrame with 1 row and 3 columns
>>> queryHits subjectHits distance
>>> <integer> <integer> <integer>
>>> 1 1 2 2
>>>
>>>> distanceToNearest(x, Y, ignore.strand=TRUE)
>>> DataFrame with 1 row and 3 columns
>>> queryHits subjectHits distance
>>> <integer> <integer> <integer>
>>> 1 1 1 1
>>>
>>>
>>> Finally,
>>>
>>>> follow(ranges(x), ranges(Y))
>>> [1] NA
>>>
>>> The issue (along with GenomicRanges:::.nearest) must come down to this:
>>>
>>>> follow(x, Y) # how can this be?
>>> [1] 2
>>
>> This behavior is due to the fact that a '*' range can compare itself to either '+' or '-'. The 'x' is '*" located at position 5. When comparing it to the '+' Y[1]
>> we think of both ranges as '+'. precede() and follow() locations are determined by moving from 3' to 5'. On '+' strand this is from left to right so 5 precedes 6.
>>
>> > precede(x, Y[1])
>> [1] 1
>>
>> but does not follow it
>>
>> > follow(x, Y[1])
>> [1] NA
>>
>> When comparing the '*' to Y[2] we think of both ranges as '-'. On the '-' moving from 3' to 5' is right to left so the 5 follows 7
>>
>> > follow(x, Y[2])
>> [1] 1
>>
>> but does not precede it
>>
>> > precede(x, Y[2])
>> [1] NA
>>
>> When we set ignore.strand=TRUE, all ranges are considered '+'.
>>
>> Valerie
>>> whereas
>>>
>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think
>>> [1] NA
>>>
>>>> sessionInfo()
>>> R Under development (unstable) (2012-10-10 r60908)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices datasets utils methods base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list