[Bioc-sig-seq] Using SVN for a "data package" with HTS data -- howto???
Leonardo Collado Torres
lcollado at lcg.unam.mx
Wed Feb 17 02:55:13 CET 2010
Hello everyone,
How are you doing? I hope that everything is working out great for you ^_^.
Anyhow, I'm emailing you because I have a Subversion / R / HTS related
question. A few of us (4 right now) in my lab want to analyze some
Illumina GAIIx data and the idea is to use R as the backbone. We want to
keep all the results in .Rdata format and kind of build an "internal"
package so that the biologists at the lab could then load the tables
easily. Kind of what Patrick Aboyoun told us at BioC2009. So, we want:
A) Major Script
This one will call the individual scripts that do a step on the
workflow. It will help us remember what we did and in what order.
Actually, a .Rnw vignette file would be much better.
B) Individual Scripts
These will have code but no function definitions. For example, on one of
these you could call the "aligner" through a function, then read the
results, find the read coverage per base, make a plot. Kind of analysis
modules.
C) Package
There we'll define all the functions that we'll be called by the
"individual" scripts, examples, documentation for the functions, etc.
Also, we'll save the results from the scripts as R objects; most likely,
data frames. Some might be large (10mb?).
The Illumina data and some big files like the alignments will not be
kept on the package.
The idea is that someone or a small team will develop individual
scripts, but the package and the major script will be edited by everyone
participating. Now, I think that using Subversion is the way to go.
However, I'm puzzled at what SVN hosting service we should use... We are
not building open source software; it's more like a data package -- VJ
Carey talked about them at BioC2009. Eventually it would be great to
share the package, but for some months it will all be a work in progress
meant to be seen only by those in the lab/project. On a bad scenario the
package would never make it out of the lab.
I'm not aware if there is a public SVN hosting service that meets our
needs. I guess that we could use Google Code or Rforge (just to mention
a few) and not distribute the url for those "lab-only" months -- anyone
could find randomly find it. Or should we hire one of the commercial SVN
hosting services to keep the work private? (check
http://www.svnhostingcomparison.com/ ) Hosting it at a local server is a
problem for us since they are quite restrictive and svn
checkouts/commits would most likely be blocked. They've had bad luck
with exterior attacks on the servers.
Otherwise I think that all the people involved could use the same server
user and use SVN only at the server. Something very similar to using SVN
on your laptop with 2 directories: the checkout one and the "repository"
one (check
http://www.guyrutenberg.com/2007/10/29/creating-local-svn-repository-home-repository/
).
As you can notice, I'm quite the newbie on SVN and working
collaboratively with Illumina GA data. Any tips are more than welcome :)
I also asked on SEQanswers:
http://seqanswers.com/forums/showthread.php?t=4071
Thank you and greetings,
Leonardo
--
Leonardo Collado Torres, Bachelor in Genomic Sciences
Member of Dr. Enrique Morett's lab and Winter Genomics
UNAM Campus Cuernavaca, Mexico
Homepage: http://www.lcg.unam.mx/~lcollado/
Phone: [52] (777) 313-28-05
More information about the Bioc-sig-sequencing
mailing list