[Bioc-sig-seq] ExpressionSet alikes for next-gen data

Fri Apr 2 18:55:54 CEST 2010

On 04/02/2010 08:57 AM, Vincent Carey wrote:
> my unfiltered reaction is to keep it in chipseq -- it would be nice for
> GenomicRanges to become quite stable and highly generic.  some subclassing
> of GRanges will doubtless go on, but when the target use case is ChIP-seq
> analysis, the fact that chipseq has some analysis tools should not prevent
> it from being the incubator for more general structure designs that do not
> address these specific analysis approaches.
> 
> if we find that this inhibits reuse we can take some other approach.  with
> relatively mature focused resource importation facilities now available
> there should be no inhibition.

Not sure where to insert my 2 cents into this thread, but wanted to note
that ExpressionSet doesn't really provide much guidance about what goes
in to phenoData or featureData -- these are tabula rasa for the user to
populate at will. This seems to have worked well enough; it is flexible
and there has not been a proliferation of classes for the annotation of
samples or features for the user or developer to master.

Martin

> 
> On Fri, Apr 2, 2010 at 11:43 AM, Michael Lawrence <lawrence.michael at gene.com
>> wrote:
> 
>> I've recently taken over the maintenance/development of the chipseq package
>> and have plans for a lot of refactoring, including some new formal classes
>> for ChIP-seq data. I'm wondering though if 'chipseq' is the best place,
>> given that it also includes some specific analytical methods. That's not a
>> huge deal, but might GenomicRanges be the place for these high-level
>> structures?
>>
>>
>> On Fri, Apr 2, 2010 at 8:31 AM, Vincent Carey <stvjc at channing.harvard.edu>wrote:
>>
>>>
>>>
>>> On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence <
>>> lawrence.michael at gene.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey <
>>>> stvjc at channing.harvard.edu> wrote:
>>>>
>>>>> To get a bit more concrete regarding these notions, the leeBamViews
>>>>> package is in the experimental data archive, a VERY rudimentary illustration
>>>>> of a workflow rooted in BAM archive files through region specification and
>>>>> read counting.  For the very latest checkin, after running
>>>>>
>>>>> example(bs1)
>>>>>
>>>>> we have an ad hoc tabulation of read counts:
>>>>>
>>>>> bs1> tabulateReads(bs1, "+")
>>>>>          intv1  intv2
>>>>> start   861250 863000
>>>>> end     862750 864000
>>>>> isowt.5   3673   2692
>>>>> isowt.6   3770   2650
>>>>> rlp.5     1532   1045
>>>>> rlp.6     1567   1139
>>>>> ssr.1     4304   3052
>>>>> ssr.2     4627   3381
>>>>> xrn.1     2841   1693
>>>>> xrn.2     3477   2197
>>>>>
>>>>> or, by setting as.GRanges, a GRanges-based representation
>>>>>
>>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE)
>>>>> GRanges with 2 ranges and 9 elementMetadata values
>>>>>     seqnames           ranges strand |        name   isowt.5   isowt.6
>>>>>        <Rle>        <IRanges>  <Rle> | <character> <integer> <integer>
>>>>> [1]  Scchr13 [861250, 862750]      + |       intv1      3673      3770
>>>>> [2]  Scchr13 [863000, 864000]      + |       intv2      2692      2650
>>>>>         rlp.5     rlp.6     ssr.1     ssr.2     xrn.1     xrn.2
>>>>>     <integer> <integer> <integer> <integer> <integer> <integer>
>>>>> [1]      1532      1567      4304      4627      2841      3477
>>>>> [2]      1045      1139      3052      3381      1693      2197
>>>>>
>>>>> seqlengths
>>>>> Scchr13
>>>>>      NA
>>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO
>>>>>> metadata(OO)
>>>>> list()
>>>>>
>>>>> It seems that we would want more structure in a metadata component to
>>>>> get closer to the values of ExpressionSet discipline.  We would also want
>>>>> some accommodation of this kind of representation in the downstream packages
>>>>> like edgeR, DEseq.
>>>>>
>>>>>
>>>> The actual 'metadata' slot was meant to be general, in order to
>>>> accommodate all needs. If a particular type of data requires a certain
>>>> structure, then additional formal classes may be necessary.  For example,
>>>> gene expression RNA-seq may want a featureData equivalent annotating each
>>>> transcript, whereas with ChIP-seq data, that sort of structure would make
>>>> less sense, short of some additional assumptions.
>>>>
>>>
>>> I agree completely.  Our task is to think/experiment about how to suitably
>>> specialize these structures for most effective downstream use.  Reuse by
>>> multiple downstream toolchains would be great.
>>>
>>>
>>
>>>> Michael
>>>>
>>>>> sessionInfo()
>>>>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388)
>>>>> x86_64-apple-darwin10.2.0
>>>>>
>>>>> locale:
>>>>> [1] C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices datasets  tools     utils     methods
>>>>>
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>>  [1] leeBamViews_0.99.3  BSgenome_1.15.18    Rsamtools_0.2.1
>>>>>  [4] Biostrings_2.15.25  GenomicRanges_0.1.3 IRanges_1.5.74
>>>>>  [7] Biobase_2.7.5       weaver_1.13.0       codetools_0.2-2
>>>>> [10] digest_0.4.1
>>>>>
>>>>>
>>>>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <mtmorgan at fhcrc.org>wrote:
>>>>>
>>>>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote:
>>>>>>> On Wed, Mar 31, 2010 at 3:55 AM, David Rossell <
>>>>>>> david.rossell at irbbarcelona.org> wrote:
>>>>>>>
>>>>>>>> Following a recent thread, I also have found convenient to store
>>>>>> nextgen
>>>>>>>> data as RangedData instead of ShortRead objects. They require far
>>>>>> less
>>>>>>>> memory and make feasible working with several samples at the same
>>>>>> time (in
>>>>>>>> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with
>>>>>>>> RangedData I haven't struck the upper limit yet).
>>>>>>>>
>>>>>>>> I am thinking about taking this idea a step forward: RangedDataList
>>>>>> allows
>>>>>>>> storing info from several samples (e.g. IP and control) in a single
>>>>>> object.
>>>>>>>> The only problem is RangedDataList does not store information about
>>>>>> the
>>>>>>>> samples, e.g. the phenoData we're used to in ExpressionSet objects.
>>>>>> My idea
>>>>>>>> is to define something like a "SequenceSet" class, which would
>>>>>> contain a
>>>>>>>> RangedDataList with the ranges, a phenoData with sample information,
>>>>>> and
>>>>>>>> possibly also information about the experiment (e.g. with the MIAME
>>>>>> analog
>>>>>>>> for sequencing, MIASEQE).
>>>>>>>>
>>>>>>>> The thing is I don't want to re-invent the wheel. I haven't seen
>>>>>> that this
>>>>>>>> is implemented yet, but is someone working on it? Any criticism/
>>>>>> ideas?
>>>>>>>>
>>>>>>>>
>>>>>>> RangedDataList already supports this. See the 'elementMetadata' and
>>>>>>> 'metadata' slots in the Sequence class.
>>>>>>
>>>>>> Hi David et al.,
>>>>>>
>>>>>> I've also found the elementMetadata slot excellent for this purpose.
>>>>>> The ShortRead data objects retain sequence and quality information,
>>>>>> this
>>>>>> information is often not needed after a certain point in the analysis.
>>>>>>
>>>>>> Wanted to point to the GenomicRanges package in Bioc-devel, which has a
>>>>>> GRanges class that is more fastidious about strand information (maybe a
>>>>>> plus?) and conforms more to an 'I am a rectangular data structure'
>>>>>> world
>>>>>> view. Also the GappedAlignments class for efficiently representing
>>>>>> large
>>>>>> numbers of reads.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> --
>>>>>>>> David Rossell, PhD
>>>>>>>> Manager, Bioinformatics and Biostatistics unit
>>>>>>>> IRB Barcelona
>>>>>>>> Tel (+34) 93 402 0217
>>>>>>>> Fax (+34) 93 402 0257
>>>>>>>> http://www.irbbarcelona.org/bioinformatics
>>>>>>>>
>>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>>
>>>>>>>
>>>>>>>       [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Martin Morgan
>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
> 

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793