[BioC] fastest way to keep score when reduce Granges data

Ou, Jianhong Jianhong.Ou at umassmed.edu
Wed Feb 26 22:17:39 CET 2014


Cool code. Yes, it is resampling (sorry I don't know this name). Your idea
is do coverage first and then use view-summarization methods to resample
the data. In the sample code we use fixed-size tiling windows. After
resampling, the signal will be averaged by the window size. If I want to
remove the affection of 0 points, I could just use viewApply and write a
function to filter the 0 points. I got it.

Thank you a lot.


Yours Sincerely,

Jianhong Ou

LRB 670A
Program in Gene Function and Expression
364 Plantation Street Worcester,
MA 01605




On 2/26/14 2:58 PM, "Hervé Pagès" <hpages at fhcrc.org> wrote:

>Hi Jianhong,
>
>Thanks for the new code. Still didn't work out of the box for me but I
>managed to run it after some minor edit. Please understand that before
>we can show you the "high efficient code", we need to understand exactly
>what you're trying to do. And since you didn't give much details, the
>"slow code" is the only thing we have.
>
>So what you're trying to do looks like a common use case to me: you
>have a numeric variable defined a along the genome (the score in your
>case), and you want to summarize it on tiling windows (often fixed-size
>windows but they don't need to be). It's a classic problem in digital
>signal processing, called resampling. Note that you cannot just "keep"
>the score, you need to summarize in a way or another. In your case,
>you choose to sum the scores associated with the ranges that overlap
>with a given tiling window.
>
>This topic is covered in the examples section of the tileGenome()
>function in the GenomicRanges package. Note that the approach described
>there uses the "weighted coverage" of the variable to solve the problem,
>no findOverlaps() or reduce() is needed:
>
>   score <- coverage(.dat, weight="score")
>
>This puts the score in an RleList object so now there is a score
>associated to each genomic position. Note that when the input
>object has overlapping ranges (which is the case for your input
>object '.dat'), the score associated to a given genomic position is
>the sum of the scores associated to the original ranges that cover
>that position.
>
>Then to summarize 'score' on your fixed-size tiling windows, you need
>a summarizing function like the binnedAverage() function shown in
>?tileGenome. binnedAverage() computes the average on each window but
>it's easy to write a summarizing function that computes the sum:
>
>      binnedSum <- function(bins, numvar, mcolname)
>      {
>          stopifnot(is(bins, "GRanges"))
>          stopifnot(is(numvar, "RleList"))
>          stopifnot(identical(seqlevels(bins), names(numvar)))
>          bins_per_chrom <- split(ranges(bins), seqnames(bins))
>          sums_list <- lapply(names(numvar),
>              function(seqname) {
>                  views <- Views(numvar[[seqname]],
>                                 bins_per_chrom[[seqname]])
>                  viewSums(views)
>              })
>          new_mcol <- unsplit(sums_list, as.factor(seqnames(bins)))
>          mcols(bins)[[mcolname]] <- new_mcol
>          bins
>      }
>
>Then:
>
>      GRwin2 <- binnedSum(GRwin, score, "binned_score")
>
>This will not give you the same result as what you get with your
>"slow code" below because of how you deal with original ranges
>spanning more than one tiling window (you arbitrary assign the range
>to one tiling window), and also because the contribution of the
>original score values to our final "binned score" here is weighted
>by the width of the original range (but since all the ranges in your
>'.dat' object have width=2, you can easily adjust this).
>
>Hope this helps,
>H.
>
>
>On 02/26/2014 08:40 AM, Ou, Jianhong wrote:
>> Hi Herve,
>>
>> I am not mean I have the code. As I asked, I want the high efficient
>>one.
>> What I am doing now is that something like this (much faster than last
>> codes)
>>
>> size <- 50000
>> FUN <- sum
>>
>> .dat <- GRanges("chr1", IRanges(start=1:size, width=2), strand="+",
>> score=sample(1:size, size))
>> windowSize <- 10
>> GRwin <- GRanges("chr1",
>> IRanges(start=(0:(size/windowSize))*windowSize+scale[1]-1,
>>                                              width=windowSize),
>>strand="+")
>> ##new codes, rename seqnames and reduce
>> system.time({
>> ##split by strand NO NEED
>> ol <- as.data.frame(findOverlaps(.dat, GRwin))
>> ol <- ol[!duplicated(ol[,1]), 2]
>> ##change the seqnames by windows.
>> new.seqname <- paste(seqnames(.dat), "__gp", ol, sep="")
>> .newData <- GRanges(new.seqname, IRanges(start(.dat), end(.dat),
>> names=names(.dat)), strand=strand(.dat))
>> mcols(.newData) <- mcols(.dat)
>> .datR <- reduce(.newData, with.mapping=TRUE)
>> .datR$score <- sapply(.datR$mapping, function(.idx)
>>FUN(.dat$score[.idx]))
>> .datReduceWithScore <- GRanges(gsub("__gp\\d+$", "", seqnames(.datR)),
>> IRanges(start(.datR), end(.datR)), strand=strand(.datR),
>> score=score(.datR))
>> })
>>
>> ##user  system elapsed
>> ##1.469   0.022   1.492
>> ##old codes, split and reduce
>> system.time({
>> ol <- findOverlaps(.dat, GRwin)
>> ol <- as.data.frame(ol)
>> ol <- ol[!duplicated(ol[,1]),]
>> .datS <- split(.dat, ol[,2])
>> reduceValue <- function(.datReduce){
>>                    .datReduceM <- reduce(.datReduce, with.mapping=TRUE)
>>                    wid <- width(.datReduce)
>>                    .datReduceScore <- .datReduce$score
>>                    .datReduceM$score <- sapply(.datReduceM$mapping,
>> function(.idx){
>> FUN(.datReduceScore[.idx])
>>                    })
>>                    .datReduceM$mapping <- NULL
>>                    .datReduceM
>>                }
>> .datReduceWithScore2 <- lapply(.datS, reduceValue)
>> .datReduceWithScore2 <- unlist(GRangesList(.datReduceWithScore2))
>> })
>>
>> ##   user  system elapsed
>> ##300.591   3.833 310.751
>> .datReduceWithScore <-
>> .datReduceWithScore[order(as.character(seqnames(.datReduceWithScore)),
>> start(.datReduceWithScore))]
>> names(.datReduceWithScore2) <- NULL
>> identical(.datReduceWithScore, .datReduceWithScore2)
>> ##TRUE
>>
>> Yours Sincerely,
>>
>> Jianhong Ou
>>
>> LRB 670A
>> Program in Gene Function and Expression
>> 364 Plantation Street Worcester,
>> MA 01605
>>
>>
>>
>>
>> On 2/25/14 8:21 PM, "Hervé Pagès" <hpages at fhcrc.org> wrote:
>>
>>> Hi Jianhong,
>>>
>>> It would help enormously if you could send code that we can actually
>>> run. Thanks!
>>>
>>> H.
>>>
>>>
>>> On 02/24/2014 07:53 AM, Ou, Jianhong wrote:
>>>> Hi ALL,
>>>>
>>>> I want to reduce a GRanges data by fixed window size and keep scores
>>>> after reduce. My code is
>>>>
>>>> .dat <- GRanges("chr1", Iranges(start=1:50, width=2), strand="+",
>>>> score=Sample(1:50, 50))
>>>> windowSize <- 10
>>>> Grwin <- GRanges("chr1", IRanges(start=(0:5)*windowSize+scale[1]-1,
>>>>                                             width=windowSize),
>>>> strand="+")
>>>> ol <- findOverlaps(.dat, GRwin)
>>>> ol <- as.data.frame(ol)
>>>> ol <- ol[!duplicated(ol[,1]),]
>>>> .dat <- split(.dat, ol[,2])
>>>> reduceValue <- function(.datReduce){
>>>>                   .datReduceM <- reduce(.datReduce, with.mapping=TRUE)
>>>>                   wid <- width(.datReduce)
>>>>                   .datReduceScore <- .datReduce$value
>>>>                   .datReduceM$score <- sapply(.datReduceM$mapping,
>>>> function(.idx){
>>>>
>>>> round(sum(.datReduceScore[.idx]*wid[.idx])/sum(wid[.idx]))
>>>>                   })
>>>>                   .datReduceM$mapping <- NULL
>>>>                   .datReduceM
>>>>               }
>>>> .dat <- lapply(.dat, reduceValue)
>>>> .dat <- unlist(GRangesList(.dat))
>>>>
>>>> But the efficiency is very low. What is the best way to keep scores
>>>> when reduce GRanges data by fixed window size? Thanks for your help.
>>>>
>>>> Yours sincerely,
>>>>
>>>> Jianhong Ou
>>>>
>>>> LRB 670A
>>>> Program in Gene Function and Expression
>>>> 364 Plantation Street Worcester,
>>>> MA 01605
>>>>
>>>> 	[[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>
>
>-- 
>Hervé Pagès
>
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M1-B514
>P.O. Box 19024
>Seattle, WA 98109-1024
>
>E-mail: hpages at fhcrc.org
>Phone:  (206) 667-5791
>Fax:    (206) 667-1319



More information about the Bioconductor mailing list