[BioC] GenomicRanges::nearest (especially follow)

Harris A. Jaffee hj at jhu.edu
Wed Oct 17 18:22:36 CEST 2012


On Oct 17, 2012, at 11:31 AM, Cook, Malcolm wrote:

> Indeed, for distance to behave as a metric it must be that distance(x,x)
> == 0 for all x
> 
> Here is some further (consistent) oddness:
> 
>> compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3)))
> [1] -5

Thanks for the tip on compare.  Looks like a nice classification scheme.
I should start reading vignettes!

> I would expect the value to be 0 here.  Comparing x with itself should
> yield 0.  After all they are identical!

Yeah, identical but *empty*.  Therefore, none of the scenarios depicted
in the man page seems appropriate (even 0/g):

                -6 a: x[i]: .oooo.......         6 m: x[i]: .......oooo.
                      y[i]: .......oooo.              y[i]: .oooo.......
          
                -5 b: x[i]: ..oooo......         5 l: x[i]: ......oooo..
                      y[i]: ......oooo..              y[i]: ..oooo......
          
                -4 c: x[i]: ...oooo.....         4 k: x[i]: .....oooo...
                      y[i]: .....oooo...              y[i]: ...oooo.....
          
                -3 d: x[i]: ...oooooo...         3 j: x[i]: .....oooo...
                      y[i]: .....oooo...              y[i]: ...oooooo...
          
                -2 e: x[i]: ..oooooooo..         2 i: x[i]: ....oooo....
                      y[i]: ....oooo....              y[i]: ..oooooooo..
          
                -1 f: x[i]: ...oooo.....         1 h: x[i]: ...oooooo...
                      y[i]: ...oooooo...              y[i]: ...oooo.....
          
                                0 g: x[i]: ...oooooo...
                                     y[i]: ...oooooo...


Again, I like NA in these cases.


> --Malcolm
> 
> 
> On 10/17/12 9:50 AM, "Tim Triche, Jr." <tim.triche at gmail.com> wrote:
> 
>> Hmm, that's odd.  And it seems inconsistent with the usual distance() and
>> distanceToNearest() methods, which would return 0 in this case, yes?
>> 
>> e.g. if the IRanges were GRanges and sitting on top of each other on the
>> same strand, or if the first was distance()'ed against itself (the same
>> thing)
>> 
>> That doesn't make sense to me either.  This doesn't seem like an issue of
>> theology to me, but rather inconsistency with what the docs say will
>> happen.
>> 
>> 
>> 
>> On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj at jhu.edu> wrote:
>> 
>>> As I said, any rule is fine, especially if they can serve a purpose
>>> via their "location".  But you also have this, which seems weird:
>>> 
>>>> distance(IRanges(2,1), IRanges(2,1))
>>>        [1] 1
>>> 
>>> To say they are not treated any differently is a big stretch to me,
>>> since their defining formula (end = start-1) is already a violation
>>> of my senses.  [I wonder if SEW = (start, NA, 0) wouldn't have been
>>> more useful.]
>>> 
>>> Just so we don't continue off into useless theology, I should say that
>>> I started looking into the distance code with the idea of adding a
>>> "signed" option (say, distance(x,y)<0 iff x is downstream from y), and
>>> I wanted to understand the landscape first.
>>> 
>>> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote:
>>> 
>>>> Hi Harris, Val, Michael,
>>>> 
>>>> Right now, all those distances are equal to 8, which I think is what
>>>> we want:
>>>> 
>>>> distance(IRanges(1,2), IRanges(10,16))
>>>> distance(IRanges(10,16), IRanges(1,2))
>>>> distance(IRanges(1,2), IRanges(10,9))
>>>> distance(IRanges(10,9),IRanges(1,2))
>>>> distance(IRanges(3,2), IRanges(10,16))
>>>> distance(IRanges(10,16), IRanges(3,2))
>>>> distance(IRanges(3,2), IRanges(10,9))
>>>> distance(IRanges(10,9), IRanges(3,2))
>>>> 
>>>> Yes, from a mathematical point of view, the distance between a
>>>> non-empty set and an empty set if not defined, but I'm not sure there
>>>> would be much to gain in doing this. Some use cases would be needed.
>>>> What's nice about the current behavior is that zero-width ranges are
>>>> not treated any differently than other ranges.
>>>> 
>>>> H.
>>>> 
>>>> 
>>>> On 10/16/2012 12:21 PM, Harris A. Jaffee wrote:
>>>>> Any rule is fine with me.  They would not occur for us except by
>>> mistake.
>>>>> I was basically raising an alert that the current situation is not
>>> quite
>>>>> complete.
>>>>> 
>>>>> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote:
>>>>> 
>>>>>> Since these zero-width ranges do have a position, I'm not sure why
>>> we
>>> cannot calculate their distance. As long as we have an established rule
>>> about how we handle them. For example, is it just from the start? That's
>>> probably most intuitive.
>>>>>> 
>>>>>> Michael
>>>>>> 
>>>>>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain <
>>> vobencha at fhcrc.org> wrote:
>>>>>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote:
>>>>>> Fantastic, thank you.  I'm not sure I knew about that page!
>>>>>> 
>>>>>> For example, ?precede gives me the IRanges nearest-methods page,
>>>>>> and only if I explicitly request ?'nearest-methods' can I choose
>>>>>> the GenomicRanges page.  That brings up another issue: ?distance
>>>>>> again sends me to the IRanges nearest-methods help, and that only
>>>>>> documents distanceToNearest, which is not enough.  The spec for
>>>>>> distance actually lives only on
>>>>>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see.
>>>>>> 
>>>>>> I can update the IRanges man page to include distance().
>>>>>> 
>>>>>> Finally, and admittedly a total edge case, but zero-width ranges
>>>>>> appear to be mishandled, or not handled, by the IRanges distance
>>>>>> code.  I think they should be infinitely far from anything, or at
>>>>>> least NA, and there should not be any nearest or distanceToNearest
>>>>>> value for a zero-width query.
>>>>>> cc'ing Herve and Michael for history.
>>>>>> 
>>>>>> In general I agree that zero-width ranges should not have a distance
>>> from any other range. I favor returning NA instead of inf since the
>>> range
>>> (albeit zero-width) does have a location.
>>>>>> 
>>>>>> One concern I have is that we use zero-width ranges to represent
>>> insertions. In that context we may want to compute the distance to the
>>> nearest insertion. I do not have a concrete use case - just thinking of
>>> what might come. I'd be interested in opinions on this.
>>>>>> 
>>>>>> Valerie
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> x
>>>>>> IRanges of length 1
>>>>>>     start end width
>>>>>> [1]     2   1     0
>>>>>> 
>>>>>> distance(x, x)
>>>>>> [1] 1
>>>>>> 
>>>>>> y
>>>>>> IRanges of length 1
>>>>>>     start end width
>>>>>> [1]     3   2     0
>>>>>> 
>>>>>> distance(x, y)
>>>>>> [1] 2
>>>>>> 
>>>>>> z
>>>>>> IRanges of length 1
>>>>>>     start end width
>>>>>> [1]     3   3     1
>>>>>> 
>>>>>> distance(x, z)
>>>>>> [1] 2
>>>>>> 
>>>>>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote:
>>>>>> 
>>>>>> A few follow up items from an off-list conversation with Harris.
>>>>>> 
>>>>>> 1) I reversed the use of 5' and3' in my response below. Sorry about
>>> that.
>>>>>> 
>>>>>> 2) The primary landing page for ?precede is in IRanges. I've
>>> expanded
>>> the \seealso section of this page to point the user to the man page in
>>> GenomicRanges.
>>>>>> 
>>>>>> 3) The precede/follow man page in GenomicRanges now  specifically
>>> states that 5' to 3' is the relevant orientation for precede/follow. For
>>> those interested in taking a look, the man page has several aliases,
>>>>>> 
>>>>>> ?'nearest-methods'
>>>>>> ?'precede,GenomicRanges,GenomicRanges-method'
>>>>>> ?'follow,GenomicRanges,GenomicRanges-method'
>>>>>> 
>>>>>> Changes can be found in IRanges 1.17.3  and GenomicRanges 1.11.4.
>>>>>> 
>>>>>> Thanks Harris.
>>>>>> 
>>>>>> Valerie
>>>>>> 
>>>>>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote:
>>>>>> Hi Harris,
>>>>>> 
>>>>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote:
>>>>>> Apologies in advance if I just don't understand the ignore.strand
>>> switch, or
>>>>>> perhaps GRanges objects.  Also, I have not even tried to understand
>>>>>> 
>>>>>>    GenomicRanges:::.GenomicRanges_findPrecedeFollow
>>>>>> 
>>>>>> I have the impression that a query with strand entirely "*"
>>> essentially implies
>>>>>> ignore.strand, but this case here is handled correctly only if
>>> ignore.strand is
>>>>>> TRUE (consistent with distanceToNearest but not distance, which
>>> seems
>>> correct):
>>>>>> 
>>>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1",
>>> strand="*")
>>>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)),
>>> seqnames="chr1",
>>> strand=c("+","-"))
>>>>>> distance(x, Y[1])
>>>>>> [1] 1
>>>>>> distance(x, Y[2])
>>>>>> [1] 2
>>>>>> 
>>>>>> nearest(x, Y, ignore.strand=TRUE)    # correct
>>>>>> [1] 1
>>>>>> nearest(ranges(x), ranges(Y))           # also correct
>>>>>> [1] 1
>>>>>> 
>>>>>> However,
>>>>>> 
>>>>>> nearest(x, Y)
>>>>>> [1] 2
>>>>>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and
>>> 1.10.2.
>>>>>> 
>>>>>> The problem was in .nearest() where I computed the preceding and
>>> following distance of the ranges. I was missing abs() so the distance
>>> for
>>> the second range was -2 instead of 2. When checking to see which range
>>> was
>>> closest
>>>>>> 
>>>>>> 1<  -2
>>>>>> [1] FALSE
>>>>>> 
>>>>>> so the second range won.
>>>>>> distanceToNearest(x, Y)
>>>>>> DataFrame with 1 row and 3 columns
>>>>>>  queryHits subjectHits  distance
>>>>>> <integer>  <integer>  <integer>
>>>>>> 1         1           2         2
>>>>>> 
>>>>>> distanceToNearest(x, Y, ignore.strand=TRUE)
>>>>>> DataFrame with 1 row and 3 columns
>>>>>>   queryHits subjectHits  distance
>>>>>> <integer>  <integer>  <integer>
>>>>>> 1         1           1         1
>>>>>> 
>>>>>> 
>>>>>> Finally,
>>>>>> 
>>>>>> follow(ranges(x), ranges(Y))
>>>>>> [1] NA
>>>>>> 
>>>>>> The issue (along with GenomicRanges:::.nearest) must come down to
>>> this:
>>>>>> 
>>>>>> follow(x, Y)                # how can this be?
>>>>>> [1] 2
>>>>>> This behavior is due to the fact that a '*' range can compare itself
>>> to either '+' or '-'.  The 'x' is '*" located at position 5.  When
>>> comparing it to the '+'  Y[1]
>>>>>> we think of both ranges as '+'. precede() and follow() locations are
>>> determined by moving from 3' to 5'. On '+' strand this is from left to
>>> right so 5 precedes 6.
>>>>>> 
>>>>>> precede(x, Y[1])
>>>>>> [1] 1
>>>>>> 
>>>>>> but does not follow it
>>>>>> 
>>>>>> follow(x, Y[1])
>>>>>> [1] NA
>>>>>> 
>>>>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On
>>> the
>>> '-' moving from  3' to 5' is right to left so the  5 follows 7
>>>>>> 
>>>>>> follow(x, Y[2])
>>>>>> [1] 1
>>>>>> 
>>>>>> but does not precede it
>>>>>> 
>>>>>> precede(x, Y[2])
>>>>>> [1] NA
>>>>>> 
>>>>>> When we set ignore.strand=TRUE, all ranges are considered '+'.
>>>>>> 
>>>>>> Valerie
>>>>>> whereas
>>>>>> 
>>>>>> follow(x, Y, ignore.strand=TRUE)    # correct, I think
>>>>>> [1] NA
>>>>>> 
>>>>>> sessionInfo()
>>>>>> R Under development (unstable) (2012-10-10 r60908)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>> 
>>>>>> locale:
>>>>>>  [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
>>>>>>  [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
>>>>>>  [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915
>>>>>>  [7] LC_PAPER=C                     LC_NAME=C
>>>>>>  [9] LC_ADDRESS=C                   LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>>>> 
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices datasets  utils     methods   base
>>>>>> 
>>>>>> other attached packages:
>>>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0       BiocGenerics_0.5.0
>>>>>> 
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] parallel_2.16.0 stats4_2.16.0   tools_2.16.0
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Hervé Pagès
>>>> 
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>> 
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>> 
>> 
>> 
>> --
>> *A model is a lie that helps you see the truth.*
>> *
>> *
>> Howard 
>> Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>> 
>>       [[alternative HTML version deleted]]
>> 
> 



More information about the Bioconductor mailing list