[Bioc-sig-seq] AlignedRead and complex subsetting
Ivan Gregoretti
ivangreg at gmail.com
Wed Sep 2 19:48:27 CEST 2009
Thank you very much everybody. It works great.
It may be trivial for most list subscribers but for the record I'd
like to describe the whole mini-workflow. Notice that one command is
optional.
Ivan
suppressMessages(library(ShortRead))
suppressMessages(library(rtracklayer))
setMethod("%in%", c("AlignedRead", "RangesList"),
function(x, table)
{
## consider only sensible alignemnts
chr <- chromosome(x)
pos <- position(x)
wd <- width(x)
notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
## find overlap
chr <- chr[notNA]
rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
width=split(wd[notNA], chr)))
olap <- rl %in% table
## map to original indicies
len <- seq_len(length(x))
idx <- unlist(split(len[notNA], chr), use.names=FALSE)
len %in% idx[unlist(olap)]
})
# load tags
dirPath <- '/my/dir/'
pattern <- 's_1_export.txt'
aln <- readAligned(dirPath, pattern, 'SolexaExport')
# load list of loci
myRegions <- import('myregions.bed')
# OPTIONAL name correction, ie from 'chr1' to 'chr1.fa'
names(myRegions) <- paste(names(myRegions), '.fa', sep='')
myRegionsRL <- ranges(myRegions)
# segregate by in/out membership
alnIn <- aln[aln %in% myRegionsRL]
alnOut <- aln[!(aln %in% myRegionsRL)]
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rtracklayer_1.5.12 RCurl_1.1-0 bitops_1.0-4.1 ShortRead_1.3.27
[5] lattice_0.17-25 BSgenome_1.13.10 Biostrings_2.13.34 IRanges_1.3.60
loaded via a namespace (and not attached):
[1] Biobase_2.5.5 grid_2.10.0 hwriter_1.1 tools_2.10.0 XML_2.6-0
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878
On Wed, Sep 2, 2009 at 1:01 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
> Ivan Gregoretti wrote:
>> Hello Steve, Martin and Everybody,
>>
>> First, thank you both for your help.
>>
>> As suggested by Steve, I look at
>>
>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>>
>> by it didn't quite applied to my case. I need to iterate over
>> chromosomes and I need to keep strand information.
>>
>> I'll try this new algorithm Martin just wrote. By the way, Martin,
>> what kind of object should rangesList be?
>
> an instance of the RangesList class
>
>> Say that I load my regions like this
>>
>> myRegions <- import('myregions.bed')
>>
>> Can I coerce that IRanges instance myRegions to the rangestList class? How?
>
> I think ranges(myRegions).
>
> I've forgotten what your sessionInfo() is; mine is
>
>> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-09-01 r49517)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ShortRead_1.3.29 lattice_0.17-25 BSgenome_1.13.10
> Biostrings_2.13.34
> [5] IRanges_1.3.65 rtracklayer_1.5.12 RCurl_1.1-0 bitops_1.0-4.1
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.5.5 grid_2.10.0 hwriter_1.1 tools_2.10.0 XML_2.6-0
>
>
> Martin
>
>> Thank you!
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Tue, Sep 1, 2009 at 7:41 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
>>> Hi Ivan --
>>>
>>> Steve Lianoglou wrote:
>>>> Hi Ivan,
>>>>
>>>> On Aug 31, 2009, at 4:54 PM, Ivan Gregoretti wrote:
>>>>
>>>>> Hello Everybody,
>>>>>
>>>>> How do you subset an AlignedRead instance to keep (or reject) tags
>>>>> that lay within a set of genomic regions?
>>>>>
>>>>>
>>>>> Example
>>>>>
>>>>> Lets say that I have an AlignedRead instance called aln.
>>>>>
>>>>> Now let's say that I have a set of positions in BED style:
>>>>>
>>>>> (chromosome, start end)
>>>>> ch1 1000000 1000050
>>>>> chrX 20000000 20100000
>>>>> ...(many more)...
>>>>>
>>>>> We can imagine that I have the BED set loaded as a data frame.
>>>>>
>>>>> Is it possible to pick from aln only the tags within (or outside) the
>>>>> features defined in the table described above?
>>>> I think that you should convert your BED file to an IRanges object, and
>>>> use overlap with your ranges + your readAligned object to get what your
>>> I wrote this, as a trial implementation of %in%. Is that useful? (also
>>> !(... %in% ...) though that would, e.g., include NAs.
>>>
>>>
>>> setMethod("%in%", c("AlignedRead", "RangesList"),
>>> function(x, table)
>>> {
>>> ## consider only sensible alignemnts
>>> chr <- chromosome(x)
>>> pos <- position(x)
>>> wd <- width(x)
>>> notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
>>> ## find overlap
>>> chr <- chr[notNA]
>>> rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
>>> width=split(wd[notNA], chr)))
>>> olap <- rl %in% table
>>> ## map to original indicies
>>> len <- seq_len(length(x))
>>> idx <- unlist(split(len[notNA], chr), use.names=FALSE)
>>> len %in% idx[unlist(olap)]
>>> })
>>>
>>> One would then
>>>
>>> aln[aln %in% rangesList]
>>>
>>> Martin
>>>
>>>> after. See Martin's post about something like this in this thread:
>>>>
>>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>>>>
>>>> To get the reads *outside* of your ranges, maybe you can call the
>>>> ``gaps`` on your bed/ranges and then do the same thing ... or perhaps
>>>> ``setdiff(ranges, aln)`` might work, too? (where aln is your IRanges
>>>> converted alignedRead object (if necessary)).
>>>>
>>>> -steve
>>>>
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>> | Memorial Sloan-Kettering Cancer Center
>>>> | Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
More information about the Bioc-sig-sequencing
mailing list