[R-pkgs] plyr: version 1.2
Hadley Wickham
hadley at rice.edu
Fri Sep 10 14:36:20 CEST 2010
plyr is a set of tools for a common set of problems: you need to
__split__ up a big data structure into homogeneous pieces, __apply__ a
function to each piece and then __combine__ all the results back
together. For example, you might want to:
* fit the same model each patient subsets of a data frame
* quickly calculate summary statistics for each group
* perform group-wise transformations like scaling or standardising
It's already possible to do this with base R functions (like split and
the apply family of functions), but plyr makes it all a bit easier
with:
* totally consistent names, arguments and outputs
* convenient parallelisation through the foreach package
* input from and output to data.frames, matrices and lists
* progress bars to keep track of long running operations
* built-in error recovery, and informative error messages
* labels that are maintained across all transformations
Considerable effort has been put into making plyr fast and memory
efficient, and in many cases plyr is as fast as, or faster than, the
built-in functions.
You can find out more at http://had.co.nz/plyr/, including a 20 page
introductory guide, http://had.co.nz/plyr/plyr-intro.pdf. You can ask
questions about plyr (and data-manipulation in general) on the plyr
mailing list. Sign up at http://groups.google.com/group/manipulatr
Version 1.2 (2010-09-09)
------------------------------------------------------------------------------
NEW FEATURES
* l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE,
applies functions in parallel using a parallel backend registered with the
foreach package:
x <- seq_len(20)
wait <- function(i) Sys.sleep(0.1)
system.time(llply(x, wait))
# user system elapsed
# 0.007 0.005 2.005
library(doMC)
registerDoMC(2)
system.time(llply(x, wait, .parallel = TRUE))
# user system elapsed
# 0.020 0.011 1.038
This work has been generously supported by BD (Becton Dickinson).
MINOR CHANGES
* a*ply and m*ply gain an .expand argument that controls whether data frames
produce a single output dimension (one element for each row), or an output
dimension for each variable.
* new vaggregate (vector aggregate) function, which is equivalent to tapply,
but much faster (~ 10x), since it avoids copying the data.
* llply: for simple lists and vectors, with no progress bar, no extra info,
and no parallelisation, llply calls lapply directly to avoid all the
overhead associated with those unused extra features.
* llply: in serial case, for loop replaced with custom C function that takes
about 40% less time (or about 20% less time than lapply). Note that as a
whole, llply still has much more overhead than lapply.
* round_any now lives in plyr instead of reshape
BUG FIXES
* list_to_array works correct even when there are missing values in the array.
This is particularly important for daply.
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/
More information about the R-packages
mailing list