[BioC] GenomicRanges::nearest (especially follow)
Valerie Obenchain
vobencha at fhcrc.org
Sat Oct 13 01:17:03 CEST 2012
Hi Harris,
On 10/11/2012 02:27 PM, Harris A. Jaffee wrote:
> Apologies in advance if I just don't understand the ignore.strand switch, or
> perhaps GRanges objects. Also, I have not even tried to understand
>
> GenomicRanges:::.GenomicRanges_findPrecedeFollow
>
> I have the impression that a query with strand entirely "*" essentially implies
> ignore.strand, but this case here is handled correctly only if ignore.strand is
> TRUE (consistent with distanceToNearest but not distance, which seems correct):
>
>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*")
>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-"))
>> distance(x, Y[1])
> [1] 1
>> distance(x, Y[2])
> [1] 2
>
>> nearest(x, Y, ignore.strand=TRUE) # correct
> [1] 1
>> nearest(ranges(x), ranges(Y)) # also correct
> [1] 1
>
> However,
>
>> nearest(x, Y)
> [1] 2
Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 1.10.2.
The problem was in .nearest() where I computed the preceding and
following distance of the ranges. I was missing abs() so the distance
for the second range was -2 instead of 2. When checking to see which
range was closest
> 1 < -2
[1] FALSE
so the second range won.
>> distanceToNearest(x, Y)
> DataFrame with 1 row and 3 columns
> queryHits subjectHits distance
> <integer> <integer> <integer>
> 1 1 2 2
>
>> distanceToNearest(x, Y, ignore.strand=TRUE)
> DataFrame with 1 row and 3 columns
> queryHits subjectHits distance
> <integer> <integer> <integer>
> 1 1 1 1
>
>
> Finally,
>
>> follow(ranges(x), ranges(Y))
> [1] NA
>
> The issue (along with GenomicRanges:::.nearest) must come down to this:
>
>> follow(x, Y) # how can this be?
> [1] 2
This behavior is due to the fact that a '*' range can compare itself to
either '+' or '-'. The 'x' is '*" located at position 5. When
comparing it to the '+' Y[1]
we think of both ranges as '+'. precede() and follow() locations are
determined by moving from 3' to 5'. On '+' strand this is from left to
right so 5 precedes 6.
> precede(x, Y[1])
[1] 1
but does not follow it
> follow(x, Y[1])
[1] NA
When comparing the '*' to Y[2] we think of both ranges as '-'. On the
'-' moving from 3' to 5' is right to left so the 5 follows 7
> follow(x, Y[2])
[1] 1
but does not precede it
> precede(x, Y[2])
[1] NA
When we set ignore.strand=TRUE, all ranges are considered '+'.
Valerie
> whereas
>
>> follow(x, Y, ignore.strand=TRUE) # correct, I think
> [1] NA
>
>> sessionInfo()
> R Under development (unstable) (2012-10-10 r60908)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices datasets utils methods base
>
> other attached packages:
> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0
>
> loaded via a namespace (and not attached):
> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list