[BioC] GenomicRanges::nearest (especially follow)
Harris A. Jaffee
hj at jhu.edu
Wed Oct 17 18:22:36 CEST 2012
On Oct 17, 2012, at 11:31 AM, Cook, Malcolm wrote:
> Indeed, for distance to behave as a metric it must be that distance(x,x)
> == 0 for all x
>
> Here is some further (consistent) oddness:
>
>> compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3)))
> [1] -5
Thanks for the tip on compare. Looks like a nice classification scheme.
I should start reading vignettes!
> I would expect the value to be 0 here. Comparing x with itself should
> yield 0. After all they are identical!
Yeah, identical but *empty*. Therefore, none of the scenarios depicted
in the man page seems appropriate (even 0/g):
-6 a: x[i]: .oooo....... 6 m: x[i]: .......oooo.
y[i]: .......oooo. y[i]: .oooo.......
-5 b: x[i]: ..oooo...... 5 l: x[i]: ......oooo..
y[i]: ......oooo.. y[i]: ..oooo......
-4 c: x[i]: ...oooo..... 4 k: x[i]: .....oooo...
y[i]: .....oooo... y[i]: ...oooo.....
-3 d: x[i]: ...oooooo... 3 j: x[i]: .....oooo...
y[i]: .....oooo... y[i]: ...oooooo...
-2 e: x[i]: ..oooooooo.. 2 i: x[i]: ....oooo....
y[i]: ....oooo.... y[i]: ..oooooooo..
-1 f: x[i]: ...oooo..... 1 h: x[i]: ...oooooo...
y[i]: ...oooooo... y[i]: ...oooo.....
0 g: x[i]: ...oooooo...
y[i]: ...oooooo...
Again, I like NA in these cases.
> --Malcolm
>
>
> On 10/17/12 9:50 AM, "Tim Triche, Jr." <tim.triche at gmail.com> wrote:
>
>> Hmm, that's odd. And it seems inconsistent with the usual distance() and
>> distanceToNearest() methods, which would return 0 in this case, yes?
>>
>> e.g. if the IRanges were GRanges and sitting on top of each other on the
>> same strand, or if the first was distance()'ed against itself (the same
>> thing)
>>
>> That doesn't make sense to me either. This doesn't seem like an issue of
>> theology to me, but rather inconsistency with what the docs say will
>> happen.
>>
>>
>>
>> On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj at jhu.edu> wrote:
>>
>>> As I said, any rule is fine, especially if they can serve a purpose
>>> via their "location". But you also have this, which seems weird:
>>>
>>>> distance(IRanges(2,1), IRanges(2,1))
>>> [1] 1
>>>
>>> To say they are not treated any differently is a big stretch to me,
>>> since their defining formula (end = start-1) is already a violation
>>> of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been
>>> more useful.]
>>>
>>> Just so we don't continue off into useless theology, I should say that
>>> I started looking into the distance code with the idea of adding a
>>> "signed" option (say, distance(x,y)<0 iff x is downstream from y), and
>>> I wanted to understand the landscape first.
>>>
>>> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote:
>>>
>>>> Hi Harris, Val, Michael,
>>>>
>>>> Right now, all those distances are equal to 8, which I think is what
>>>> we want:
>>>>
>>>> distance(IRanges(1,2), IRanges(10,16))
>>>> distance(IRanges(10,16), IRanges(1,2))
>>>> distance(IRanges(1,2), IRanges(10,9))
>>>> distance(IRanges(10,9),IRanges(1,2))
>>>> distance(IRanges(3,2), IRanges(10,16))
>>>> distance(IRanges(10,16), IRanges(3,2))
>>>> distance(IRanges(3,2), IRanges(10,9))
>>>> distance(IRanges(10,9), IRanges(3,2))
>>>>
>>>> Yes, from a mathematical point of view, the distance between a
>>>> non-empty set and an empty set if not defined, but I'm not sure there
>>>> would be much to gain in doing this. Some use cases would be needed.
>>>> What's nice about the current behavior is that zero-width ranges are
>>>> not treated any differently than other ranges.
>>>>
>>>> H.
>>>>
>>>>
>>>> On 10/16/2012 12:21 PM, Harris A. Jaffee wrote:
>>>>> Any rule is fine with me. They would not occur for us except by
>>> mistake.
>>>>> I was basically raising an alert that the current situation is not
>>> quite
>>>>> complete.
>>>>>
>>>>> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote:
>>>>>
>>>>>> Since these zero-width ranges do have a position, I'm not sure why
>>> we
>>> cannot calculate their distance. As long as we have an established rule
>>> about how we handle them. For example, is it just from the start? That's
>>> probably most intuitive.
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain <
>>> vobencha at fhcrc.org> wrote:
>>>>>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote:
>>>>>> Fantastic, thank you. I'm not sure I knew about that page!
>>>>>>
>>>>>> For example, ?precede gives me the IRanges nearest-methods page,
>>>>>> and only if I explicitly request ?'nearest-methods' can I choose
>>>>>> the GenomicRanges page. That brings up another issue: ?distance
>>>>>> again sends me to the IRanges nearest-methods help, and that only
>>>>>> documents distanceToNearest, which is not enough. The spec for
>>>>>> distance actually lives only on
>>>>>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see.
>>>>>>
>>>>>> I can update the IRanges man page to include distance().
>>>>>>
>>>>>> Finally, and admittedly a total edge case, but zero-width ranges
>>>>>> appear to be mishandled, or not handled, by the IRanges distance
>>>>>> code. I think they should be infinitely far from anything, or at
>>>>>> least NA, and there should not be any nearest or distanceToNearest
>>>>>> value for a zero-width query.
>>>>>> cc'ing Herve and Michael for history.
>>>>>>
>>>>>> In general I agree that zero-width ranges should not have a distance
>>> from any other range. I favor returning NA instead of inf since the
>>> range
>>> (albeit zero-width) does have a location.
>>>>>>
>>>>>> One concern I have is that we use zero-width ranges to represent
>>> insertions. In that context we may want to compute the distance to the
>>> nearest insertion. I do not have a concrete use case - just thinking of
>>> what might come. I'd be interested in opinions on this.
>>>>>>
>>>>>> Valerie
>>>>>>
>>>>>>
>>>>>>
>>>>>> x
>>>>>> IRanges of length 1
>>>>>> start end width
>>>>>> [1] 2 1 0
>>>>>>
>>>>>> distance(x, x)
>>>>>> [1] 1
>>>>>>
>>>>>> y
>>>>>> IRanges of length 1
>>>>>> start end width
>>>>>> [1] 3 2 0
>>>>>>
>>>>>> distance(x, y)
>>>>>> [1] 2
>>>>>>
>>>>>> z
>>>>>> IRanges of length 1
>>>>>> start end width
>>>>>> [1] 3 3 1
>>>>>>
>>>>>> distance(x, z)
>>>>>> [1] 2
>>>>>>
>>>>>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote:
>>>>>>
>>>>>> A few follow up items from an off-list conversation with Harris.
>>>>>>
>>>>>> 1) I reversed the use of 5' and3' in my response below. Sorry about
>>> that.
>>>>>>
>>>>>> 2) The primary landing page for ?precede is in IRanges. I've
>>> expanded
>>> the \seealso section of this page to point the user to the man page in
>>> GenomicRanges.
>>>>>>
>>>>>> 3) The precede/follow man page in GenomicRanges now specifically
>>> states that 5' to 3' is the relevant orientation for precede/follow. For
>>> those interested in taking a look, the man page has several aliases,
>>>>>>
>>>>>> ?'nearest-methods'
>>>>>> ?'precede,GenomicRanges,GenomicRanges-method'
>>>>>> ?'follow,GenomicRanges,GenomicRanges-method'
>>>>>>
>>>>>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4.
>>>>>>
>>>>>> Thanks Harris.
>>>>>>
>>>>>> Valerie
>>>>>>
>>>>>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote:
>>>>>> Hi Harris,
>>>>>>
>>>>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote:
>>>>>> Apologies in advance if I just don't understand the ignore.strand
>>> switch, or
>>>>>> perhaps GRanges objects. Also, I have not even tried to understand
>>>>>>
>>>>>> GenomicRanges:::.GenomicRanges_findPrecedeFollow
>>>>>>
>>>>>> I have the impression that a query with strand entirely "*"
>>> essentially implies
>>>>>> ignore.strand, but this case here is handled correctly only if
>>> ignore.strand is
>>>>>> TRUE (consistent with distanceToNearest but not distance, which
>>> seems
>>> correct):
>>>>>>
>>>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1",
>>> strand="*")
>>>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)),
>>> seqnames="chr1",
>>> strand=c("+","-"))
>>>>>> distance(x, Y[1])
>>>>>> [1] 1
>>>>>> distance(x, Y[2])
>>>>>> [1] 2
>>>>>>
>>>>>> nearest(x, Y, ignore.strand=TRUE) # correct
>>>>>> [1] 1
>>>>>> nearest(ranges(x), ranges(Y)) # also correct
>>>>>> [1] 1
>>>>>>
>>>>>> However,
>>>>>>
>>>>>> nearest(x, Y)
>>>>>> [1] 2
>>>>>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and
>>> 1.10.2.
>>>>>>
>>>>>> The problem was in .nearest() where I computed the preceding and
>>> following distance of the ranges. I was missing abs() so the distance
>>> for
>>> the second range was -2 instead of 2. When checking to see which range
>>> was
>>> closest
>>>>>>
>>>>>> 1< -2
>>>>>> [1] FALSE
>>>>>>
>>>>>> so the second range won.
>>>>>> distanceToNearest(x, Y)
>>>>>> DataFrame with 1 row and 3 columns
>>>>>> queryHits subjectHits distance
>>>>>> <integer> <integer> <integer>
>>>>>> 1 1 2 2
>>>>>>
>>>>>> distanceToNearest(x, Y, ignore.strand=TRUE)
>>>>>> DataFrame with 1 row and 3 columns
>>>>>> queryHits subjectHits distance
>>>>>> <integer> <integer> <integer>
>>>>>> 1 1 1 1
>>>>>>
>>>>>>
>>>>>> Finally,
>>>>>>
>>>>>> follow(ranges(x), ranges(Y))
>>>>>> [1] NA
>>>>>>
>>>>>> The issue (along with GenomicRanges:::.nearest) must come down to
>>> this:
>>>>>>
>>>>>> follow(x, Y) # how can this be?
>>>>>> [1] 2
>>>>>> This behavior is due to the fact that a '*' range can compare itself
>>> to either '+' or '-'. The 'x' is '*" located at position 5. When
>>> comparing it to the '+' Y[1]
>>>>>> we think of both ranges as '+'. precede() and follow() locations are
>>> determined by moving from 3' to 5'. On '+' strand this is from left to
>>> right so 5 precedes 6.
>>>>>>
>>>>>> precede(x, Y[1])
>>>>>> [1] 1
>>>>>>
>>>>>> but does not follow it
>>>>>>
>>>>>> follow(x, Y[1])
>>>>>> [1] NA
>>>>>>
>>>>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On
>>> the
>>> '-' moving from 3' to 5' is right to left so the 5 follows 7
>>>>>>
>>>>>> follow(x, Y[2])
>>>>>> [1] 1
>>>>>>
>>>>>> but does not precede it
>>>>>>
>>>>>> precede(x, Y[2])
>>>>>> [1] NA
>>>>>>
>>>>>> When we set ignore.strand=TRUE, all ranges are considered '+'.
>>>>>>
>>>>>> Valerie
>>>>>> whereas
>>>>>>
>>>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think
>>>>>> [1] NA
>>>>>>
>>>>>> sessionInfo()
>>>>>> R Under development (unstable) (2012-10-10 r60908)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
>>>>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
>>>>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
>>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices datasets utils methods base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone: (206) 667-5791
>>>> Fax: (206) 667-1319
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>>
>> --
>> *A model is a lie that helps you see the truth.*
>> *
>> *
>> Howard
>> Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>> [[alternative HTML version deleted]]
>>
>
More information about the Bioconductor
mailing list