[R] Tools For Preparing Data For Analysis
    Wensui Liu 
    liuwensui at gmail.com
       
    Fri Jun  8 15:50:47 CEST 2007
    
    
  
I had mentioned exactly the same thing to others and the feedback I got is -
'when you have a hammer, everything will look like a nail'
^_^.
On 6/7/07, Frank E Harrell Jr <f.harrell at vanderbilt.edu> wrote:
> Robert Wilkins wrote:
> > As noted on the R-project web site itself ( www.r-project.org ->
> > Manuals -> R Data Import/Export ), it can be cumbersome to prepare
> > messy and dirty data for analysis with the R tool itself. I've also
> > seen at least one S programming book (one of the yellow Springer ones)
> > that says, more briefly, the same thing.
> > The R Data Import/Export page recommends examples using SAS, Perl,
> > Python, and Java. It takes a bit of courage to say that ( when you go
> > to a corporate software web site, you'll never see a page saying "This
> > is the type of problem that our product is not the best at, here's
> > what we suggest instead" ). I'd like to provide a few more
> > suggestions, especially for volunteers who are willing to evaluate new
> > candidates.
> >
> > SAS is fine if you're not paying for the license out of your own
> > pocket. But maybe one reason you're using R is you don't have
> > thousands of spare dollars.
> > Using Java for data cleaning is an exercise in sado-masochism, Java
> > has a learning curve (almost) as difficult as C++.
> >
> > There are different types of data transformation, and for some data
> > preparation problems an all-purpose programming language is a good
> > choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
> > excellent regular expression facilities.
> >
> > However, for some types of complex demanding data preparation
> > problems, an all-purpose programming language is a poor choice. For
> > example: cleaning up and preparing clinical lab data and adverse event
> > data - you could do it in Perl, but it would take way, way too much
> > time. A specialized programming language is needed. And since data
> > transformation is quite different from data query, SQL is not the
> > ideal solution either.
>
> We deal with exactly those kinds of data solely using R.  R is
> exceptionally powerful for data manipulation, just a bit hard to learn.
>   Many examples are at
> http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf
>
> Frank
>
> >
> > There are only three statistical programming languages that are
> > well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
> > popular than S for data cleaning.
> >
> > If you're an R user with difficult data preparation problems, frankly
> > you are out of luck, because the products I'm about to mention are
> > new, unknown, and therefore regarded as immature. And while the
> > founders of these products would be very happy if you kicked the
> > tires, most people don't like to look at brand new products. Most
> > innovators and inventers don't realize this, I've learned it the hard
> > way.
> >
> > But if you are a volunteer who likes to help out by evaluating,
> > comparing, and reporting upon new candidates, well you could certainly
> > help out R users and the developers of the products by kicking the
> > tires of these products. And there is a huge need for such volunteers.
> >
> > 1. DAP
> > This is an open source implementation of SAS.
> > The founder: Susan Bassein
> > Find it at: directory.fsf.org/math/stats (GNU GPL)
> >
> > 2. PSPP
> > This is an open source implementation of SPSS.
> > The relatively early version number might not give a good idea of how
> > mature the
> > data transformation features are, it reflects the fact that he has
> > only started doing the statistical tests.
> > The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept.
> > Also at : directory.fsf.org/math/stats (GNU GPL)
> >
> > 3. Vilno
> > This uses a programming language similar to SPSS and SAS, but quite unlike S.
> > Essentially, it's a substitute for the SAS datastep, and also
> > transposes data and calculates averages and such. (No t-tests or
> > regressions in this version). I created this, during the years
> > 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
> > my opinion. The tarball includes about 100 or so test cases used for
> > debugging - for logical calculation errors, but not for extremely high
> > volumes of data.
> > The maintenance of Vilno has slowed down, because I am currently
> > (desparately) looking for employment. But once I've found new
> > employment and living quarters and settled in, I will continue to
> > enhance Vilno in my spare time.
> > The founder: that would be me, Robert Wilkins
> > Find it at: code.google.com/p/vilno ( GNU GPL )
> > ( In particular, the tarball at code.google.com/p/vilno/downloads/list
> > , since I have yet to figure out how to use Subversion ).
> >
> >
> > 4. Who knows?
> > It was not easy to find out about the existence of DAP and PSPP. So
> > who knows what else is out there. However, I think you'll find a lot
> > more statistics software ( regression , etc ) out there, and not so
> > much data transformation software. Not many people work on data
> > preparation software. In fact, the category is so obscure that there
> > isn't one agreed term: data cleaning , data munging , data crunching ,
> > or just getting the data ready for analysis.
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
> --
> Frank E Harrell Jr   Professor and Chair           School of Medicine
>                       Department of Biostatistics   Vanderbilt University
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)
    
    
More information about the R-help
mailing list