[BioC] matrix like object with Rle columns
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Wed Jun 27 17:07:16 CEST 2012
One comment: since matrix is a vector with a dim attribute I see that
the natural parallel is doing the same for Rle. Nevertheless, that
would put an upper limit on the number of runLengths in the entire
matrix. My impression (which could be wrong) is that we would need to
implement essentially all matrix-like numeric operations from scratch
anyway, so it may be worthwhile to consider using a list of Rle's
where each Rle is a column, instead of a single Rle to represent all
columns. Clearly that depends on implementation details, but if we
really need to do everything from scratch, a list of columns might be
more flexible (and perhaps even easier to code).
Kasper
On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
> Seems like it could be a nice thing to have. Presumably one would create an
> Array subclass of Vector that would add a "dim" attribute. Then Matrix could
> extend that to constrain dim to length two (unfortunately colliding with the
> Matrix class in the Matrix package). Then RleMatrix extends Matrix to
> implement the actual data storage and many of the accelerated methods. As
> you said, row-oriented methods would be tough.
>
> Any takers?
>
> Michael
>
> On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com> wrote:
>>
>> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen
>> <kasperdanielhansen at gmail.com> wrote:
>> > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence
>> > <lawrence.michael at gene.com> wrote:
>> >> Patrick and I had talked about this a long time ago (essentially
>> >> putting a
>> >> "dim" attribute on an Rle), but the closest thing today is a DataFrame
>> >> with
>> >> Rle columns.
>> >>
>> >> Use case?
>> >
>> > Say I have whole-genome data (for example coverage) on multiple
>> > samples. Usually, this is far easier to think of as a matrix (in my
>> > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc
>> > (in fact, probably the whole API from matrixStats). This is
>> > especially nice when you have multiple coverage-like tracks on each
>> > sample, so you could have
>> > trackA : genome by samples
>> > trackB : genome by samples
>> > ...
>> >
>> > You could think of this as a SummarizedExperiment, but with
>> > _extremely_ big matrices in the assay slot.
>> >
>> > I want to take advantage of the Rle structure to store the data more
>> > efficiently and also to do potentially faster computations.
>> >
>> > This is actually closer to my use case where I currently use matrices
>> > with ~30M rows (which works fine), but I would like to expand to ~800M
>> > rows (which would suck a bit).
>> >
>> > You could also think of a matrix-like object with Rle columns as an
>> > alternative sparse matrix structure. In a typical sparse matrix you
>> > only store the non-zero entities, here we only store the
>> > change-points. Depending on the structure of the matrix this could be
>> > an efficient storage of an otherwise dense matrix.
>> >
>> > So essentially, what I want, is to have mathematical operations on
>> > this object, where I would utilize that I know that all entities are
>> > numbers so the typical matrix operations makes sense.
>> >
>> > [ side question which could be relevant in this discussion: for a
>> > numeric Rle is there some notion of precision - say I have truly
>> > numeric values with tons of digits, and I want to consider two numbers
>> > part of the same run if |x1 -x2|<epsilon? ]
>>
>> You can see that Pete has had similar thoughts in
>> genoset/R/DataFrame-methods.R, although he only has colMeans (which is
>> the easy one).
>>
>> Kasper
>>
>> > Kasper
>> >
>> >>
>> >> Michael
>> >>
>> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen
>> >> <kasperdanielhansen at gmail.com> wrote:
>> >>>
>> >>> Do we have a matrix-like object, but where the columns are Rle's?
>> >>>
>> >>> Kasper
>> >>>
>> >>> _______________________________________________
>> >>> Bioconductor mailing list
>> >>> Bioconductor at r-project.org
>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> Search the archives:
>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>
>> >>
>
>
More information about the Bioconductor
mailing list