[Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?

Patrick Aboyoun paboyoun at fhcrc.org
Fri Sep 11 20:18:38 CEST 2009


Steve,
Good point. I added a clarifying statement in coverage's sift argument 
description:

  \item{shift}{
    For most methods, an integer vector (recycled to the length of
    \code{x}) specifying how each element in \code{x} should be
    (horizontally) shifted before the coverage is computed. Only shifted
    indices in the range [1, \code{width}] will be included in the coverage
    calculation.
    ...
  }


Patrick


Steve Lianoglou wrote:
> Hi Patrick,
>
> Your suggestions were what I was looking for, they're great, thanks!
>
> On Sep 11, 2009, at 12:51 PM, Patrick Aboyoun wrote:
>
>> Steve,
>> I have revised your code to simplify it a bit. I added example data 
>> as well to make it easier for others to run it.
>
> Yeah, I should have pasted the objects in an 
> easy-to-paste-into-R-format, sorry for the oversight.
>
>> The shift and width arguments to coverage are discussed in the 
>> coverage man page.
>
> I was wrestling with them trying to figure out the correct way I was 
> expected to use them ... it just wasn't all that intuitive, to be honest.
>
> I should have noticed in the examples section that the coverage isn't 
> calculated for negative ranges (which is what you get when you shift 
> all of your ranges to start at start(bounds)) ... though it would be 
> clear if I paid more attention to the very first example of 
> ``coverage`` usage.
>
> Perhaps you might consider making an explicit comment in the 
> documentation for the width parameter regarding this fact?
>
> Thanks again,
> -steve
>
>> If you are unsure what the sift argument will do to a set of ranges, 
>> you can run the shift operation on them (e.g. shift(bounds, shift = - 
>> start(bounds) + 1)).
>
>>
>> ==== Revised code ====
>>
>> suppressMessages(library(IRanges))
>>
>> sample.reads <- IRanges(start = c(11869, 11882, 12245, 12554, 12555, 
>> 12557), width = 32)
>> bounds <- IRanges(start=11700, end=12550)
>>
>> # Get the the reads from my sample.reads
>> # object that lie within my gene bounds
>> gene.reads <- sample.reads[bounds]
>>
>> # Build my coverage Rle vector
>> cov <- coverage(gene.reads, shift = - start(bounds) + 1, width = 
>> width(bounds))
>>
>> # Create the non-Rle coverage vector that I will use down stream
>> # to make a GenomeGraph::BaseTrack object
>> counts <- as.numeric(cov)
>>
>> # Build the base-position vector and filter out positions
>> # with 0 counts from both vectors
>> keep <- counts != 0
>>
>> # This is (i) from above
>> bases <- as.integer(bounds)[keep]
>>
>> # This is (ii) from above
>> counts <- counts[keep]
>>
>> =====================
>>
>>
>> > sessionInfo()
>> R version 2.10.0 Under development (unstable) (2009-09-07 r49613)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> other attached packages:
>> [1] IRanges_1.3.72
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.10.0
>>
>>
>>
>>
>> Steve Lianoglou wrote:
>>> Hi Michael,
>>>
>>> Thanks for your suggestions so far ... being able to slice a Range
>>> object w/ another Range object somehow fell off of my radar.
>>>
>>> -steve
>>>
>>> On Thu, Sep 10, 2009 at 6:25 PM, Michael 
>>> Lawrence<mflawren at fhcrc.org> wrote:
>>>
>>>> On Thu, Sep 10, 2009 at 1:55 PM, Steve Lianoglou
>>>> <mailinglist.honeypot at gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm finding it hard to shed some of my 10 thumbs when using the
>>>>> IRanges/Rle classes, so can someone suggest the "correct" way to 
>>>>> do what my
>>>>> code does below?
>>>>>
>>>>> My goal is to use GenomeGraphs to plot a picture of my gene model, 
>>>>> along
>>>>> with the reads from 1 or more samples above it. I've actually got 
>>>>> that
>>>>> working, but I feel there's a more idiomatic way to do this, so if 
>>>>> you have
>>>>> any comments, I'd be interested in hearing them.
>>>>>
>>>>> I'll paste the function in its entirety at the end of the email. 
>>>>> It works
>>>>> as advertised, but perhaps not done in the most efficient way. For 
>>>>> brevity's
>>>>> sake, however, I've distilled the specific code for your feedback.
>>>>>
>>>>> Anyway, here's the gist:
>>>>>
>>>>> 1. sample.reads : An IRanges object of the reads aligned to a 
>>>>> particular
>>>>> chromosome from a particular rna-seq sample:
>>>>>
>>>>> R> > head(sample.reads)
>>>>> IRanges instance:
>>>>>   start   end width
>>>>> [1] 11869 11900    32
>>>>> [2] 11882 11913    32
>>>>> [3] 12245 12276    32
>>>>> [4] 12554 12585    32
>>>>> [5] 12555 12586    32
>>>>> [6] 12557 12588    32
>>>>>
>>>>> 2. bounds: An IRange object that indicates the start and end 
>>>>> position of
>>>>> my gene -- I want all the reads that land here:
>>>>>
>>>>> R> bounds
>>>>> IRanges instance:
>>>>>      start      end width
>>>>> [1] 12040238 12073571 33334
>>>>>
>>>>> Goal:
>>>>> I wan to build two vectors:
>>>>> (i) A vector of positions on my chromosome (having length == to 
>>>>> the length
>>>>> of my gene)
>>>>> (ii) A vector of counts over those positions
>>>>>
>>>>> Positions with 0 counts are removed from (i) and (ii)
>>>>>
>>>>> ==== Here's my code ====
>>>>>
>>>>> # Get the index of the reads from my sample.reads
>>>>> # object that lie within my gene bounds
>>>>> which.reads <- subjectHits(overlap(sample.reads, bounds))
>>>>>
>>>>> # Slice out these reads
>>>>> gene.reads <- reads[which.reads]
>>>>>
>>>>>
>>>> The above two steps could be simply:
>>>>
>>>>> gene.reads <- sample.reads[bounds]
>>>>>      # Build my coverage Rle vector
>>>>> cov <- coverage(gene.reads)
>>>>>
>>>>> # Only keep the part of the Rle that starts @ the start
>>>>> # of my gene to the end (I feel like doing this is weird (?))
>>>>> cov <- cov[start(bounds):length(cov)]
>>>>>
>>>>> # Create the non-Rle coverage vector that I will use down stream
>>>>> # to make a GenomeGraph::BaseTrack object
>>>>> counts <- numeric(width(bounds))
>>>>>
>>>>> # Populate this vector with the correct coverage counts
>>>>> counts[1:length(cov)] <- as.vector(cov)
>>>>>
>>>>>
>>>> As you say, this looks like it could be done easier using arguments to
>>>> coverage(). Maybe someone else could explain this?
>>>>
>>>>
>>>>> # Build the base-position vector and filter out positions
>>>>> # with 0 counts from both vectors
>>>>> keep <- counts != 0
>>>>>
>>>>> # This is (i) from above
>>>>> bases <- (start(bounds):end(bounds))[keep]
>>>>>
>>>>>
>>>> A bit shorter:
>>>> bases <- as.integer(bounds)[keep]
>>>>
>>>>
>>>>> # This is (ii) from above
>>>>> counts <- counts[keep]
>>>>>
>>>>> =====================
>>>>>
>>>>> I feel like the coverage interface, using the start/end args would 
>>>>> help
>>>>> here, no? Wouldn't this have worked to make things a bit easier?
>>>>>
>>>>> cov <- coverage(gene.reads, start=start(bounds), end=end(bounds))
>>>>>
>>>>> How would you do that by using the width/shift args? Or wouldn't you?
>>>>> Maybe I'm not clear on what the original start/end args in 
>>>>> ``coverage`` are
>>>>> meant to do?
>>>>>
>>>>> Anyway, I was just curious if there was a better way to do what I 
>>>>> need to
>>>>> do.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -steve
>>>>>
>>>>>
>>>>> ===== Here's the function in all of its glory, use it if you like 
>>>>> =====
>>>>>
>>>>> Here's a sample call:
>>>>>
>>>>> plotGenomeGraphReads(list(Sample1=reads.1, Sample2=reads.2),
>>>>>   gene.id='ENSG00000116688', title="MFN2")
>>>>>
>>>>> Here's the function:
>>>>>
>>>>> plotGenomeGraphReads <- function(sample.reads, gene.model=NULL,
>>>>> gene.id=NULL,
>>>>>                                biomart=NULL, cols=NULL, title=NULL) {
>>>>> # Plots the read counts over a particular gene model.
>>>>> #
>>>>> # You can pass your own gene.model as an IRange object, or use 
>>>>> biomaRt to
>>>>> # retrieve the model using the Ensembl Gene ID gene.id
>>>>> #
>>>>> # Parameters
>>>>> # ----------
>>>>> # sample.reads : named list[IRanges] representing reads from separate
>>>>> samples
>>>>> #                over a given chromosome (the one your gene is on!)
>>>>> # gene.model   : IRanges object of start/end positions of the gene
>>>>> # mart         : biomaRt, if you want to query ensembl for the 
>>>>> gene model
>>>>> # cols         : the colors you want to use for the reads in the
>>>>> respective
>>>>> #                samples
>>>>> if (is.null(gene.model) && is.null(gene.id)) {
>>>>>   stop("Need to get the gene.model from somewhere.")
>>>>> }
>>>>>
>>>>> # Figure out what type of gene.model we're using and get appropriate
>>>>> bounds
>>>>> if (!is.null(gene.model)) {
>>>>>   gm <- makeGeneModel(start=start(gene.model), end=end(gene.model))
>>>>>   bounds <- range(gene.model)
>>>>> } else {
>>>>>   if (is.null(biomart)) {
>>>>>     biomart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
>>>>>   }
>>>>>   gm <- makeGene(id=gene.id, type='ensembl_gene_id', biomart=biomart)
>>>>>   if (is.null(title)) {
>>>>>     title <- gene.id
>>>>>   }
>>>>>   anno <- gm at ens
>>>>>   bounds <- range(IRanges(start=anno$exon_chrom_start,
>>>>>                           end=anno$exon_chrom_end))
>>>>> }
>>>>>
>>>>> sample.names <- names(sample.reads)
>>>>>
>>>>> if (is.null(cols)) {
>>>>>   cols <- rainbow(length(sample.reads))
>>>>>   names(cols) <- sample.names
>>>>> }
>>>>>
>>>>> if (is.null(title)) {
>>>>>   title <- 'Gene Reads'
>>>>> }
>>>>>
>>>>> stopifnot(length(cols) == length(sample.reads))
>>>>> stopifnot(names(cols) == sample.names)
>>>>>
>>>>> # Calculate the coverage vectors over the gene model for each sample
>>>>> sample.coverage <- lapply(sample.names, function(name) {
>>>>>   reads <- sample.reads[[name]]
>>>>>   which.reads <- subjectHits(overlap(reads, bounds))
>>>>>   gene.reads <- reads[which.reads]
>>>>>   cov <- coverage(gene.reads)
>>>>>   cov <- cov[start(bounds):length(cov)]
>>>>>
>>>>>   counts <- numeric(width(bounds))
>>>>>   counts[1:length(cov)] <- as.vector(cov)
>>>>>   keep <- counts != 0
>>>>>   bases <- (start(bounds):end(bounds))[keep]
>>>>>   counts <- counts[keep]
>>>>>   list(pos=bases, coverage=counts)
>>>>> })
>>>>> names(sample.coverage) <- sample.names
>>>>>
>>>>> # Make y-axis the same for each sample
>>>>> max.counts <- max(sapply(sample.coverage, function(cov)
>>>>> max(cov$coverage)))
>>>>>
>>>>> tracks <- lapply(sample.names, function(name) {
>>>>>   dp <- DisplayPars(lwd=0.3, color=cols[[name]], ylim=c(0, 
>>>>> max.counts))
>>>>>   makeBaseTrack(base=sample.coverage[[name]]$pos,
>>>>>                 value=sample.coverage[[name]]$coverage,
>>>>>                 dp=dp)
>>>>> })
>>>>> names(tracks) <- names(sample.reads)
>>>>>
>>>>> title <- makeTitle(text=title)
>>>>> plot.me <- c(title, tracks, list(gm))
>>>>>
>>>>> gdPlot(plot.me, minBase=start(bounds), maxBase=end(bounds))
>>>>> }
>>>>>
>>>>>
>>>>> -- 
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>> |  Memorial Sloan-Kettering Cancer Center
>>>>> |  Weill Medical College of Cornell University
>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> Bioc-sig-sequencing at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
> -- 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>   |  Memorial Sloan-Kettering Cancer Center
>   |  Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>



More information about the Bioc-sig-sequencing mailing list