[Bioc-sig-seq] ExpressionSet alikes for next-gen data
Martin Morgan
mtmorgan at fhcrc.org
Fri Apr 2 18:55:54 CEST 2010
On 04/02/2010 08:57 AM, Vincent Carey wrote:
> my unfiltered reaction is to keep it in chipseq -- it would be nice for
> GenomicRanges to become quite stable and highly generic. some subclassing
> of GRanges will doubtless go on, but when the target use case is ChIP-seq
> analysis, the fact that chipseq has some analysis tools should not prevent
> it from being the incubator for more general structure designs that do not
> address these specific analysis approaches.
>
> if we find that this inhibits reuse we can take some other approach. with
> relatively mature focused resource importation facilities now available
> there should be no inhibition.
Not sure where to insert my 2 cents into this thread, but wanted to note
that ExpressionSet doesn't really provide much guidance about what goes
in to phenoData or featureData -- these are tabula rasa for the user to
populate at will. This seems to have worked well enough; it is flexible
and there has not been a proliferation of classes for the annotation of
samples or features for the user or developer to master.
Martin
>
> On Fri, Apr 2, 2010 at 11:43 AM, Michael Lawrence <lawrence.michael at gene.com
>> wrote:
>
>> I've recently taken over the maintenance/development of the chipseq package
>> and have plans for a lot of refactoring, including some new formal classes
>> for ChIP-seq data. I'm wondering though if 'chipseq' is the best place,
>> given that it also includes some specific analytical methods. That's not a
>> huge deal, but might GenomicRanges be the place for these high-level
>> structures?
>>
>>
>> On Fri, Apr 2, 2010 at 8:31 AM, Vincent Carey <stvjc at channing.harvard.edu>wrote:
>>
>>>
>>>
>>> On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence <
>>> lawrence.michael at gene.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey <
>>>> stvjc at channing.harvard.edu> wrote:
>>>>
>>>>> To get a bit more concrete regarding these notions, the leeBamViews
>>>>> package is in the experimental data archive, a VERY rudimentary illustration
>>>>> of a workflow rooted in BAM archive files through region specification and
>>>>> read counting. For the very latest checkin, after running
>>>>>
>>>>> example(bs1)
>>>>>
>>>>> we have an ad hoc tabulation of read counts:
>>>>>
>>>>> bs1> tabulateReads(bs1, "+")
>>>>> intv1 intv2
>>>>> start 861250 863000
>>>>> end 862750 864000
>>>>> isowt.5 3673 2692
>>>>> isowt.6 3770 2650
>>>>> rlp.5 1532 1045
>>>>> rlp.6 1567 1139
>>>>> ssr.1 4304 3052
>>>>> ssr.2 4627 3381
>>>>> xrn.1 2841 1693
>>>>> xrn.2 3477 2197
>>>>>
>>>>> or, by setting as.GRanges, a GRanges-based representation
>>>>>
>>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE)
>>>>> GRanges with 2 ranges and 9 elementMetadata values
>>>>> seqnames ranges strand | name isowt.5 isowt.6
>>>>> <Rle> <IRanges> <Rle> | <character> <integer> <integer>
>>>>> [1] Scchr13 [861250, 862750] + | intv1 3673 3770
>>>>> [2] Scchr13 [863000, 864000] + | intv2 2692 2650
>>>>> rlp.5 rlp.6 ssr.1 ssr.2 xrn.1 xrn.2
>>>>> <integer> <integer> <integer> <integer> <integer> <integer>
>>>>> [1] 1532 1567 4304 4627 2841 3477
>>>>> [2] 1045 1139 3052 3381 1693 2197
>>>>>
>>>>> seqlengths
>>>>> Scchr13
>>>>> NA
>>>>>> tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO
>>>>>> metadata(OO)
>>>>> list()
>>>>>
>>>>> It seems that we would want more structure in a metadata component to
>>>>> get closer to the values of ExpressionSet discipline. We would also want
>>>>> some accommodation of this kind of representation in the downstream packages
>>>>> like edgeR, DEseq.
>>>>>
>>>>>
>>>> The actual 'metadata' slot was meant to be general, in order to
>>>> accommodate all needs. If a particular type of data requires a certain
>>>> structure, then additional formal classes may be necessary. For example,
>>>> gene expression RNA-seq may want a featureData equivalent annotating each
>>>> transcript, whereas with ChIP-seq data, that sort of structure would make
>>>> less sense, short of some additional assumptions.
>>>>
>>>
>>> I agree completely. Our task is to think/experiment about how to suitably
>>> specialize these structures for most effective downstream use. Reuse by
>>> multiple downstream toolchains would be great.
>>>
>>>
>>
>>>> Michael
>>>>
>>>>> sessionInfo()
>>>>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388)
>>>>> x86_64-apple-darwin10.2.0
>>>>>
>>>>> locale:
>>>>> [1] C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices datasets tools utils methods
>>>>>
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] leeBamViews_0.99.3 BSgenome_1.15.18 Rsamtools_0.2.1
>>>>> [4] Biostrings_2.15.25 GenomicRanges_0.1.3 IRanges_1.5.74
>>>>> [7] Biobase_2.7.5 weaver_1.13.0 codetools_0.2-2
>>>>> [10] digest_0.4.1
>>>>>
>>>>>
>>>>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <mtmorgan at fhcrc.org>wrote:
>>>>>
>>>>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote:
>>>>>>> On Wed, Mar 31, 2010 at 3:55 AM, David Rossell <
>>>>>>> david.rossell at irbbarcelona.org> wrote:
>>>>>>>
>>>>>>>> Following a recent thread, I also have found convenient to store
>>>>>> nextgen
>>>>>>>> data as RangedData instead of ShortRead objects. They require far
>>>>>> less
>>>>>>>> memory and make feasible working with several samples at the same
>>>>>> time (in
>>>>>>>> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with
>>>>>>>> RangedData I haven't struck the upper limit yet).
>>>>>>>>
>>>>>>>> I am thinking about taking this idea a step forward: RangedDataList
>>>>>> allows
>>>>>>>> storing info from several samples (e.g. IP and control) in a single
>>>>>> object.
>>>>>>>> The only problem is RangedDataList does not store information about
>>>>>> the
>>>>>>>> samples, e.g. the phenoData we're used to in ExpressionSet objects.
>>>>>> My idea
>>>>>>>> is to define something like a "SequenceSet" class, which would
>>>>>> contain a
>>>>>>>> RangedDataList with the ranges, a phenoData with sample information,
>>>>>> and
>>>>>>>> possibly also information about the experiment (e.g. with the MIAME
>>>>>> analog
>>>>>>>> for sequencing, MIASEQE).
>>>>>>>>
>>>>>>>> The thing is I don't want to re-invent the wheel. I haven't seen
>>>>>> that this
>>>>>>>> is implemented yet, but is someone working on it? Any criticism/
>>>>>> ideas?
>>>>>>>>
>>>>>>>>
>>>>>>> RangedDataList already supports this. See the 'elementMetadata' and
>>>>>>> 'metadata' slots in the Sequence class.
>>>>>>
>>>>>> Hi David et al.,
>>>>>>
>>>>>> I've also found the elementMetadata slot excellent for this purpose.
>>>>>> The ShortRead data objects retain sequence and quality information,
>>>>>> this
>>>>>> information is often not needed after a certain point in the analysis.
>>>>>>
>>>>>> Wanted to point to the GenomicRanges package in Bioc-devel, which has a
>>>>>> GRanges class that is more fastidious about strand information (maybe a
>>>>>> plus?) and conforms more to an 'I am a rectangular data structure'
>>>>>> world
>>>>>> view. Also the GappedAlignments class for efficiently representing
>>>>>> large
>>>>>> numbers of reads.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> --
>>>>>>>> David Rossell, PhD
>>>>>>>> Manager, Bioinformatics and Biostatistics unit
>>>>>>>> IRB Barcelona
>>>>>>>> Tel (+34) 93 402 0217
>>>>>>>> Fax (+34) 93 402 0257
>>>>>>>> http://www.irbbarcelona.org/bioinformatics
>>>>>>>>
>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>>
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Martin Morgan
>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-sig-sequencing
mailing list