[Bioc-sig-seq] AlignedRead and complex subsetting

Wed Sep 2 22:44:08 CEST 2009

Michael Lawrence wrote:
> Martin,
> This %in% method looks great. Will this become part of ShortRead?

yes it and as(aln, "RangesList") are in version 1.3.31, probably
available with biocLite and the devel version of R after noon, Seattle
time, tomorrow.

Both can be tricky because unaligned reads ('NA' in chromosome,
position, or width) are dropped (in as(aln, "RangesList")) or always !
%in% a RangesList (though this is consistent with how %in% works with NA
values). They also assume that reads are presented in 'leftmost' (see
?"coverage,AlignedRead-method") orientation, which is true if the reads
are coming from readAligned but might not be true if the reads have been
created 'by hand'.

Martin

> Michael
> 
> On Wed, Sep 2, 2009 at 11:19 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
> 
>     Ivan Gregoretti wrote:
>     > Thank you very much everybody. It works great.
>     >
>     > It may be trivial for most list subscribers but for the record I'd
>     > like to describe the whole mini-workflow. Notice that one command is
>     > optional.
> 
>     Thanks Ivan for posting this, it's really helpful to see these work
>     flows in action. One small change to my %in% method, for the record
> 
>     >
>     > Ivan
>     >
>     > suppressMessages(library(ShortRead))
>     > suppressMessages(library(rtracklayer))
>     >
>     > setMethod("%in%", c("AlignedRead", "RangesList"),
>     >          function(x, table)
>     > {
>     >    ## consider only sensible alignemnts
>     >    chr <- chromosome(x)
>     >    pos <- position(x)
>     >    wd <- width(x)
>     >    notNA <- !(is.na <http://is.na>(chr) | is.na
>     <http://is.na>(pos) | is.na <http://is.na>(wd))
>     >    ## find overlap
>     >    chr <- chr[notNA]
>     >    rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
>     >                            width=split(wd[notNA], chr)))
> 
>     Patrick suggested
> 
>         rl <- split(IRanges(start=pos[notNA], width=wd[notNA]), chr)
> 
>     as a cleaner implementation of this step.
> 
>     >    olap <- rl %in% table
>     >    ## map to original indicies
>     >    len <- seq_len(length(x))
>     >    idx <- unlist(split(len[notNA], chr), use.names=FALSE)
>     >    len %in% idx[unlist(olap)]
>     > })
>     >
>     > # load tags
>     > dirPath <- '/my/dir/'
>     > pattern <- 's_1_export.txt'
>     > aln <- readAligned(dirPath, pattern, 'SolexaExport')
>     >
>     > # load list of loci
>     > myRegions <- import('myregions.bed')
>     > # OPTIONAL name correction, ie from 'chr1' to 'chr1.fa'
>     > names(myRegions) <- paste(names(myRegions), '.fa', sep='')
>     > myRegionsRL <- ranges(myRegions)
>     >
>     > # segregate by in/out membership
>     > alnIn <- aln[aln %in% myRegionsRL]
>     > alnOut <- aln[!(aln %in% myRegionsRL)]
>     >
>     >
>     >> sessionInfo()
>     > R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
>     > x86_64-unknown-linux-gnu
>     >
>     > locale:
>     >  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>     >  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>     >  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>     >  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>     >  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>     > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>     >
>     > attached base packages:
>     > [1] stats     graphics  grDevices utils     datasets  methods   base
>     >
>     > other attached packages:
>     > [1] rtracklayer_1.5.12 RCurl_1.1-0        bitops_1.0-4.1    
>     ShortRead_1.3.27
>     > [5] lattice_0.17-25    BSgenome_1.13.10   Biostrings_2.13.34
>     IRanges_1.3.60
>     >
>     > loaded via a namespace (and not attached):
>     > [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1   tools_2.10.0  XML_2.6-0
>     >
>     >
>     > Ivan Gregoretti, PhD
>     > National Institute of Diabetes and Digestive and Kidney Diseases
>     > National Institutes of Health
>     > 5 Memorial Dr, Building 5, Room 205.
>     > Bethesda, MD 20892. USA.
>     > Phone: 1-301-496-1592
>     > Fax: 1-301-496-9878
>     >
>     >
>     >
>     > On Wed, Sep 2, 2009 at 1:01 PM, Martin Morgan<mtmorgan at fhcrc.org
>     <mailto:mtmorgan at fhcrc.org>> wrote:
>     >> Ivan Gregoretti wrote:
>     >>> Hello Steve, Martin and Everybody,
>     >>>
>     >>> First, thank you both for your help.
>     >>>
>     >>> As suggested by Steve, I look at
>     >>>
>     >>>
>     https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>     >>>
>     >>> by it didn't quite applied to my case. I need to iterate over
>     >>> chromosomes and I need to keep strand information.
>     >>>
>     >>> I'll try this new algorithm Martin just wrote. By the way, Martin,
>     >>> what kind of object should rangesList be?
>     >> an instance of the RangesList class
>     >>
>     >>> Say that I load my regions like this
>     >>>
>     >>> myRegions <- import('myregions.bed')
>     >>>
>     >>> Can I coerce that IRanges instance myRegions to the rangestList
>     class? How?
>     >> I think ranges(myRegions).
>     >>
>     >> I've forgotten what your sessionInfo() is; mine is
>     >>
>     >>> sessionInfo()
>     >> R version 2.10.0 Under development (unstable) (2009-09-01 r49517)
>     >> x86_64-unknown-linux-gnu
>     >>
>     >> locale:
>     >>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>     >>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>     >>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>     >>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>     >>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>     >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>     >>
>     >> attached base packages:
>     >> [1] stats     graphics  grDevices utils     datasets  methods   base
>     >>
>     >> other attached packages:
>     >> [1] ShortRead_1.3.29   lattice_0.17-25    BSgenome_1.13.10
>     >> Biostrings_2.13.34
>     >> [5] IRanges_1.3.65     rtracklayer_1.5.12 RCurl_1.1-0      
>      bitops_1.0-4.1
>     >>
>     >> loaded via a namespace (and not attached):
>     >> [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1   tools_2.10.0  XML_2.6-0
>     >>
>     >>
>     >> Martin
>     >>
>     >>> Thank you!
>     >>>
>     >>> Ivan
>     >>>
>     >>>
>     >>> Ivan Gregoretti, PhD
>     >>> National Institute of Diabetes and Digestive and Kidney Diseases
>     >>> National Institutes of Health
>     >>> 5 Memorial Dr, Building 5, Room 205.
>     >>> Bethesda, MD 20892. USA.
>     >>> Phone: 1-301-496-1592
>     >>> Fax: 1-301-496-9878
>     >>>
>     >>>
>     >>>
>     >>> On Tue, Sep 1, 2009 at 7:41 PM, Martin Morgan<mtmorgan at fhcrc.org
>     <mailto:mtmorgan at fhcrc.org>> wrote:
>     >>>> Hi Ivan --
>     >>>>
>     >>>> Steve Lianoglou wrote:
>     >>>>> Hi Ivan,
>     >>>>>
>     >>>>> On Aug 31, 2009, at 4:54 PM, Ivan Gregoretti wrote:
>     >>>>>
>     >>>>>> Hello Everybody,
>     >>>>>>
>     >>>>>> How do you subset an AlignedRead instance to keep (or reject)
>     tags
>     >>>>>> that lay within a set of genomic regions?
>     >>>>>>
>     >>>>>>
>     >>>>>> Example
>     >>>>>>
>     >>>>>> Lets say that I have an AlignedRead instance called aln.
>     >>>>>>
>     >>>>>> Now let's say that I have a set of positions in BED style:
>     >>>>>>
>     >>>>>> (chromosome, start end)
>     >>>>>> ch1 1000000 1000050
>     >>>>>> chrX 20000000 20100000
>     >>>>>> ...(many more)...
>     >>>>>>
>     >>>>>> We can imagine that I have the BED set loaded as a data frame.
>     >>>>>>
>     >>>>>> Is it possible to pick from aln only the tags within (or
>     outside) the
>     >>>>>> features defined in the table described above?
>     >>>>> I think that you should convert your BED file to an IRanges
>     object, and
>     >>>>> use overlap with your ranges + your readAligned object to get
>     what your
>     >>>> I wrote this, as a trial implementation of %in%. Is that
>     useful? (also
>     >>>> !(... %in% ...) though that would, e.g., include NAs.
>     >>>>
>     >>>>
>     >>>> setMethod("%in%", c("AlignedRead", "RangesList"),
>     >>>>          function(x, table)
>     >>>> {
>     >>>>    ## consider only sensible alignemnts
>     >>>>    chr <- chromosome(x)
>     >>>>    pos <- position(x)
>     >>>>    wd <- width(x)
>     >>>>    notNA <- !(is.na <http://is.na>(chr) | is.na
>     <http://is.na>(pos) | is.na <http://is.na>(wd))
>     >>>>    ## find overlap
>     >>>>    chr <- chr[notNA]
>     >>>>    rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
>     >>>>                            width=split(wd[notNA], chr)))
>     >>>>    olap <- rl %in% table
>     >>>>    ## map to original indicies
>     >>>>    len <- seq_len(length(x))
>     >>>>    idx <- unlist(split(len[notNA], chr), use.names=FALSE)
>     >>>>    len %in% idx[unlist(olap)]
>     >>>> })
>     >>>>
>     >>>> One would then
>     >>>>
>     >>>>  aln[aln %in% rangesList]
>     >>>>
>     >>>> Martin
>     >>>>
>     >>>>> after. See Martin's post about something like this in this thread:
>     >>>>>
>     >>>>>
>     https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>     >>>>>
>     >>>>> To get the reads *outside* of your ranges, maybe you can call the
>     >>>>> ``gaps`` on your bed/ranges and then do the same thing ...  or
>     perhaps
>     >>>>> ``setdiff(ranges, aln)`` might work, too? (where aln is your
>     IRanges
>     >>>>> converted alignedRead object (if necessary)).
>     >>>>>
>     >>>>> -steve
>     >>>>>
>     >>>>> --
>     >>>>> Steve Lianoglou
>     >>>>> Graduate Student: Computational Systems Biology
>     >>>>>   |  Memorial Sloan-Kettering Cancer Center
>     >>>>>   |  Weill Medical College of Cornell University
>     >>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>     <http://cbio.mskcc.org/%7Elianos/contact>
>     >>>>>
>     >>>>> _______________________________________________
>     >>>>> Bioc-sig-sequencing mailing list
>     >>>>> Bioc-sig-sequencing at r-project.org
>     <mailto:Bioc-sig-sequencing at r-project.org>
>     >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>     >>> _______________________________________________
>     >>> Bioc-sig-sequencing mailing list
>     >>> Bioc-sig-sequencing at r-project.org
>     <mailto:Bioc-sig-sequencing at r-project.org>
>     >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>     >>
>     >
>     > _______________________________________________
>     > Bioc-sig-sequencing mailing list
>     > Bioc-sig-sequencing at r-project.org
>     <mailto:Bioc-sig-sequencing at r-project.org>
>     > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
>     _______________________________________________
>     Bioc-sig-sequencing mailing list
>     Bioc-sig-sequencing at r-project.org
>     <mailto:Bioc-sig-sequencing at r-project.org>
>     https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
>