[Bioc-sig-seq] (no subject) Changed: Overall directions
Mark W Kimpel
mwkimpel at gmail.com
Tue Mar 4 00:30:47 CET 2008
As a potential end-user who is just beginning to use a large Linux
cluster and explore Rmpi, I would lobby for structures and algorithms
that rely more heavily on parallel processing and not on massive amounts
of RAM. The systems that I have available consist of nodes with 4
processors which have to share 8 GB of RAM. OTOH, I can request up to 40
nodes for one batch.
It might be interesting what other end users will have available in
regards to compute platforms and from the developers the feasibility of
parallelizing the process as much as possible.
I'm not an expert, just very interested in this as one of my close
colleagues is preparing a grant which will require these types of analyses.
Mark
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074
(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)
mwkimpel<at>gmail<dot>com
******************************************************************
Martin Morgan wrote:
> Hi Stephen -- putting this back 'on-list' so everyone can participate;
> sorry if this is not as intended...
>
>> By front end I really mean R wrappers not GUIs. It sounds like a
>> great idea to be able to do as much of my work as possible from
>> within R. From what I have seen of SOAP the minimum necessary
>> memory that programs use for alignment of illumina data is ca 16Gb
>> of RAM. The alignment problem is very threadable but whilst this may
>> speed processing I do not think this helps break the memory
>> requirement down as the look up tables have to be stored as a single
>> entity (as far as I understand).
>
>> http://soap.genomics.org.cn/
>
>> Incidentally SOAP alignment looks like it might integrate easily with R and is
>> competitive for time and alignment with MAQ and Eland.
>
> Yes developers should keep the data structure in mind, and the
> opportunities for partitioning tasks. It seems like many operations
> (though perhaps not alignments) can treat reads as independent of one
> another, and this represents a natural way of dividing big tasks into
> more memory-efficient and distributed ones. Chromosomal structure also
> provides a natural way to think of how operations might be distributed
> across processors.
>
> A basic transformation seems to be from really very big data
> (sequences) to just moderately big (e.g., alignments and scores;
> apparent SNP polymorphisms; ...).
>
>> An 'expressionSet like' object of pre-aligned or unassembled data which was
>> stored in R would be a list of length ca. 40 million with strings of ca 25-30
>> (single end). An output of alignment from SOAP would be a table of about 11
>> columns (with original seq, QC, chr, positions, flags etc..), and a bit less
>> than 40 million rows (if all are not aligned).
>
>> Is the 'expressionSet like' object really going to store this? Or
>> will it be a reference to a database or external file? I guess as
>> you say you don't necessarily need to store all of this but can
>> sample a lot of it for QC, plotting and analysis?
>
> The main conceptual insights of the ExpressionSet are an association
> of phenotype with 'data', and the abstraction of how the data is
> represented internally from how the user interacts with it.
>
> We've taken different approaches in our preliminary work (comments
> from how other developers are dealing with these issues is most
> welcome!). For QA types of operations it turns out to be fairly
> effective to visit relatively small files (e.g., solexa lanes or
> tiles) and summarize these into useful statistics for further
> manipulation (e.g., reports and visualization) at the whole-run
> level. For some exploratory alignment algorithms (see matchPDict in
> the development version of the Biobase package) that require more
> structured data representations, the approach is more
> straight-forward: representing the data requires large memory
> machines. Even here though there are some nuances, e.g., processing
> each chromosome separately.
>
> Maybe a closing thought on this is that the data describing the
> experiment might belong in SQL tables (but also fit easily into R's
> memory), but it's less clear that the sequences belong in a relational
> data base. So some other format is likely appropriate for the big
> data. Here we've basically been using the disk-based storage
> structures implied by output of the Solexa (or other) software
> pipeline. Obviously a sub-optimal solution, and it would be great to
> hear solutions that other developers have explored.
>
> Martin
>
> ------------------------------------------------------------------------------
>
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Sent: Fri 29/02/2008 17:23
> To: Stephen Henderson
> Cc: bioc-sig-sequencing at r-project.org
> Subject: Re: [Bioc-sig-seq] (no subject) Changed: Overall directions
>
>
> "Stephen Henderson" <s.henderson at ucl.ac.uk> writes:
>> OK
>>
>> Perhaps I can be first by asking what tasks you plan to cover? And how
>> do you plan to implement them in R (given the memory restrictions)? Do
>> you plan a nice front end for lots of C-code?
> Hi Stephen --
> It'll depend of course on who in the community steps up. Probably
> packages will start as a standard R interface that gets the job done,
> with pretty gui's later. Probably an early step (though perhaps not
> the very first) will be settling on a common set of S4-style classes
> to represent experiments and data, in the manner of an ExpressionSet.
>>From our end, our first pass is to assume that computer resources are
> not really an issue -- a 2 or 4 GB 32-bit operating system is not what
> we're targeting.
> Also in terms of preliminary experience, it seems like some operations
> can be done effectively at the R level (data input and QA assessment)
> but that some important steps (e.g., alignments) require clever data
> structures and algorithms that get implemented in C. It's also
> possible for some questions to exploit the structure of the data,
> e.g., analyzing Solexa data in manageable chunks corresponding to
> individual tiles.
> Martin
>> Stephen Henderson
>>
>> Cancer Institute, Paul O'Gorman Building
>>
>> Gower Street, University College London
>>
>> United Kingdom, WC1E 6BT
>>
>> +44 (0)207 679 6827
>>
>>
>>
>>
>> **********************************************************************
>> This email and any files transmitted with it are confide...{{dropped:11}}
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list