[R] collapsing a data frame

Fri Oct 12 20:20:29 CEST 2007

Here is one way.  Not sure what you wanted done with some of the other
variables, so I just chose the first one; you could do max/min:

> z <- by(h, h$BROOD, function(x){
+     # take first value of elements you don't want to change
+     data.frame(BROOD=x$BROOD[1], TICKS.mean=mean(x$TICKS),
TICKS.sd=sd(x$TICKS),
+         HEIGHT=x$HEIGHT[1], YEAR=x$YEAR[1], LOCATION=x$LOCATION[1])
+ })
> do.call('rbind', z)
    BROOD TICKS.mean TICKS.sd HEIGHT YEAR LOCATION
501   501          0 0.000000    465   95       32
502   502          0       NA    472   95       36
503   503          1 1.732051    475   95       37

On 10/12/07, Ben Bolker <bolker at ufl.edu> wrote:
>
>   Trying to find a quick/slick/easily interpretable way to
> collapse a data set.
>
>  Suppose I have a data set that looks like this:
>
> h <- structure(list(INDEX = structure(1:6, .Label = c("1", "2", "3",
> "4", "5", "6"), class = "factor"), TICKS = c(0, 0, 0, 0, 0, 3
> ), BROOD = structure(c(1L, 1L, 2L, 3L, 3L, 3L), .Label = c("501",
> "502", "503"), class = "factor"), HEIGHT = c(465, 465, 472, 475,
> 475, 475), YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("95",
> "96", "97"), class = "factor"), LOCATION = structure(c(1L, 1L,
> 2L, 3L, 3L, 3L), .Label = c("32", "36", "37"), class = "factor")), .Names =
> c("INDEX",
> "TICKS", "BROOD", "HEIGHT", "YEAR", "LOCATION"), row.names = c(NA,
> 6L), class = "data.frame")
>
> i.e.,
> > h
>  INDEX TICKS BROOD HEIGHT YEAR LOCATION
> 1     1     0   501    465   95       32
> 2     2     0   501    465   95       32
> 3     3     0   502    472   95       36
> 4     4     0   503    475   95       37
> 5     5     0   503    475   95       37
> 6     6     3   503    475   95       37
>
> I want a data set that looks like this:
>  BROOD TICKS.mean HEIGHT YEAR LOCATION
>    501          0               465      95      32
>    502          0               472      95      36
>    503          1               475      95      37
>
> (for example).  I.e.,  I want to collapse it to a dataset by brood,
> taking the mean of TICKS and reducing each of
> the other variables (would be nice to allow multiple summary
> statistics, e.g. TICKS.mean and TICKS.sd ...)
> In some ways, this is the opposite of a database join/merge
> operation -- I want to collapse the data frame back down.
> If I had the "unmerged" (i.e., the brood table) handy I could
> use it.
>
>  I know I can construct this table a bit at a time,
>  using tapply() or by()  or aggregate() to get the means.
>
>  Here's a solution that takes the first element of each factor
> and the mean of each numeric variable.  I can imagine there
> are more general/flexible solutions.  (One might want to
> specify more than one summary function, or specify that
> factors that vary within group should be dropped.)
>
> vtype = sapply(h,class)  ## variable types [numeric or factor]
> vtypes = unique(vtype)   ## possible types
> v2 = lapply(vtypes,function(z) which(vtype==z))  ## which are which?
> cfuns = list(factor=function(z)z[1],numeric=mean)## functions to apply
> m = mapply(function(w,f) { aggregate(h[w],list(h$BROOD),f) },
>  v2,cfuns,SIMPLIFY=FALSE)
> data.frame(m[[1]],m[[2]][-1])
>
>  My question is whether this is re-inventing the wheel.  Is there
> some function or package that performs this task?
>
>  cheers
>    Ben Bolker
>
> --
> View this message in context: http://www.nabble.com/collapsing-a-data-frame-tf4614195.html#a13177053
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?