[Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?

Fri Sep 11 04:05:10 CEST 2009

Hi Michael,

Thanks for your suggestions so far ... being able to slice a Range
object w/ another Range object somehow fell off of my radar.

-steve

On Thu, Sep 10, 2009 at 6:25 PM, Michael Lawrence<mflawren at fhcrc.org> wrote:
>
>
> On Thu, Sep 10, 2009 at 1:55 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>>
>> Hi all,
>>
>> I'm finding it hard to shed some of my 10 thumbs when using the
>> IRanges/Rle classes, so can someone suggest the "correct" way to do what my
>> code does below?
>>
>> My goal is to use GenomeGraphs to plot a picture of my gene model, along
>> with the reads from 1 or more samples above it. I've actually got that
>> working, but I feel there's a more idiomatic way to do this, so if you have
>> any comments, I'd be interested in hearing them.
>>
>> I'll paste the function in its entirety at the end of the email. It works
>> as advertised, but perhaps not done in the most efficient way. For brevity's
>> sake, however, I've distilled the specific code for your feedback.
>>
>> Anyway, here's the gist:
>>
>> 1. sample.reads : An IRanges object of the reads aligned to a particular
>> chromosome from a particular rna-seq sample:
>>
>> R> > head(sample.reads)
>> IRanges instance:
>>    start   end width
>> [1] 11869 11900    32
>> [2] 11882 11913    32
>> [3] 12245 12276    32
>> [4] 12554 12585    32
>> [5] 12555 12586    32
>> [6] 12557 12588    32
>>
>> 2. bounds: An IRange object that indicates the start and end position of
>> my gene -- I want all the reads that land here:
>>
>> R> bounds
>> IRanges instance:
>>       start      end width
>> [1] 12040238 12073571 33334
>>
>> Goal:
>> I wan to build two vectors:
>> (i) A vector of positions on my chromosome (having length == to the length
>> of my gene)
>> (ii) A vector of counts over those positions
>>
>> Positions with 0 counts are removed from (i) and (ii)
>>
>> ==== Here's my code ====
>>
>> # Get the index of the reads from my sample.reads
>> # object that lie within my gene bounds
>> which.reads <- subjectHits(overlap(sample.reads, bounds))
>>
>> # Slice out these reads
>> gene.reads <- reads[which.reads]
>>
>
> The above two steps could be simply:
>> gene.reads <- sample.reads[bounds]
>
>>
>> # Build my coverage Rle vector
>> cov <- coverage(gene.reads)
>>
>> # Only keep the part of the Rle that starts @ the start
>> # of my gene to the end (I feel like doing this is weird (?))
>> cov <- cov[start(bounds):length(cov)]
>>
>> # Create the non-Rle coverage vector that I will use down stream
>> # to make a GenomeGraph::BaseTrack object
>> counts <- numeric(width(bounds))
>>
>> # Populate this vector with the correct coverage counts
>> counts[1:length(cov)] <- as.vector(cov)
>>
>
> As you say, this looks like it could be done easier using arguments to
> coverage(). Maybe someone else could explain this?
>
>> # Build the base-position vector and filter out positions
>> # with 0 counts from both vectors
>> keep <- counts != 0
>>
>> # This is (i) from above
>> bases <- (start(bounds):end(bounds))[keep]
>>
>
> A bit shorter:
> bases <- as.integer(bounds)[keep]
>
>>
>> # This is (ii) from above
>> counts <- counts[keep]
>>
>> =====================
>>
>> I feel like the coverage interface, using the start/end args would help
>> here, no? Wouldn't this have worked to make things a bit easier?
>>
>> cov <- coverage(gene.reads, start=start(bounds), end=end(bounds))
>>
>> How would you do that by using the width/shift args? Or wouldn't you?
>> Maybe I'm not clear on what the original start/end args in ``coverage`` are
>> meant to do?
>>
>> Anyway, I was just curious if there was a better way to do what I need to
>> do.
>>
>> Thanks,
>>
>> -steve
>>
>>
>> ===== Here's the function in all of its glory, use it if you like =====
>>
>> Here's a sample call:
>>
>> plotGenomeGraphReads(list(Sample1=reads.1, Sample2=reads.2),
>>    gene.id='ENSG00000116688', title="MFN2")
>>
>> Here's the function:
>>
>> plotGenomeGraphReads <- function(sample.reads, gene.model=NULL,
>> gene.id=NULL,
>>                                 biomart=NULL, cols=NULL, title=NULL) {
>>  # Plots the read counts over a particular gene model.
>>  #
>>  # You can pass your own gene.model as an IRange object, or use biomaRt to
>>  # retrieve the model using the Ensembl Gene ID gene.id
>>  #
>>  # Parameters
>>  # ----------
>>  # sample.reads : named list[IRanges] representing reads from separate
>> samples
>>  #                over a given chromosome (the one your gene is on!)
>>  # gene.model   : IRanges object of start/end positions of the gene
>>  # mart         : biomaRt, if you want to query ensembl for the gene model
>>  # cols         : the colors you want to use for the reads in the
>> respective
>>  #                samples
>>  if (is.null(gene.model) && is.null(gene.id)) {
>>    stop("Need to get the gene.model from somewhere.")
>>  }
>>
>>  # Figure out what type of gene.model we're using and get appropriate
>> bounds
>>  if (!is.null(gene.model)) {
>>    gm <- makeGeneModel(start=start(gene.model), end=end(gene.model))
>>    bounds <- range(gene.model)
>>  } else {
>>    if (is.null(biomart)) {
>>      biomart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
>>    }
>>    gm <- makeGene(id=gene.id, type='ensembl_gene_id', biomart=biomart)
>>    if (is.null(title)) {
>>      title <- gene.id
>>    }
>>    anno <- gm at ens
>>    bounds <- range(IRanges(start=anno$exon_chrom_start,
>>                            end=anno$exon_chrom_end))
>>  }
>>
>>  sample.names <- names(sample.reads)
>>
>>  if (is.null(cols)) {
>>    cols <- rainbow(length(sample.reads))
>>    names(cols) <- sample.names
>>  }
>>
>>  if (is.null(title)) {
>>    title <- 'Gene Reads'
>>  }
>>
>>  stopifnot(length(cols) == length(sample.reads))
>>  stopifnot(names(cols) == sample.names)
>>
>>  # Calculate the coverage vectors over the gene model for each sample
>>  sample.coverage <- lapply(sample.names, function(name) {
>>    reads <- sample.reads[[name]]
>>    which.reads <- subjectHits(overlap(reads, bounds))
>>    gene.reads <- reads[which.reads]
>>    cov <- coverage(gene.reads)
>>    cov <- cov[start(bounds):length(cov)]
>>
>>    counts <- numeric(width(bounds))
>>    counts[1:length(cov)] <- as.vector(cov)
>>    keep <- counts != 0
>>    bases <- (start(bounds):end(bounds))[keep]
>>    counts <- counts[keep]
>>    list(pos=bases, coverage=counts)
>>  })
>>  names(sample.coverage) <- sample.names
>>
>>  # Make y-axis the same for each sample
>>  max.counts <- max(sapply(sample.coverage, function(cov)
>> max(cov$coverage)))
>>
>>  tracks <- lapply(sample.names, function(name) {
>>    dp <- DisplayPars(lwd=0.3, color=cols[[name]], ylim=c(0, max.counts))
>>    makeBaseTrack(base=sample.coverage[[name]]$pos,
>>                  value=sample.coverage[[name]]$coverage,
>>                  dp=dp)
>>  })
>>  names(tracks) <- names(sample.reads)
>>
>>  title <- makeTitle(text=title)
>>  plot.me <- c(title, tracks, list(gm))
>>
>>  gdPlot(plot.me, minBase=start(bounds), maxBase=end(bounds))
>> }
>>
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  |  Memorial Sloan-Kettering Cancer Center
>>  |  Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact