[R] Comparison of aggregate in R and group by in mysql

Sun Jan 27 03:34:48 CET 2008

Here is a test I did with a data frame that had only  100 rows, but
the factors has 7000 and 4000 levels.  I had to limit it to this
because with 7000 and 10000 (which was your data size) I ran out of
physical memory on my system (1GB).  You can see the amount of memory
that it was using which is dependent on the products of the levels
being used, and definitely not the size of the data (100 rows).  So
the real problem is the number of levels you have in your factors and
the way that R does aggregate (it calls 'tapply' internally).  So you
probably need to run with a database to handle data with this many
levels.

> n <- 100
> x <- data.frame(A=factor(sample(1:10,n,TRUE), levels=1:7000),
+     B=factor(sample(1:20,n,TRUE), levels=1:4000),
+
+     data=runif(n))
> gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 150963  4.1     350000  9.4   350000  9.4
Vcells 104706  0.8     786432  6.0   479077  3.7
> system.time(x.ta <- tapply(x$data, list(x$A, x$B), mean))
   user  system elapsed
   4.17    0.56    5.50
> gc()
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   151374   4.1     350000   9.4   350000   9.4
Vcells 28110297 214.5   46770568 356.9 42128199 321.5
>

2008/1/26 zhihuali <lzhtom at hotmail.com>:
>
> thanks, Jim.
>
> My system has a RAM of 1 GB. I guess the computed matrix is close to the limit of the memory and that's what caused the problem.  I think I'll take Wensui's suggestion and use a relational database system to handle the huge data.
>
>
>
> > Date: Sat, 26 Jan 2008 20:40:51 -0500
>
> > From: jholtman at gmail.com
> > To: lzhtom at hotmail.com
> > Subject: Re: [R] Comparison of aggregate in R and group by in mysql
> > CC: r-help at stat.math.ethz.ch
> >
> > I think with your data you will be computing a matrix that is 7049 x
> > 11704.  This will require about 700MB of memory.  What size system do
> > you have (how much memory)?  How big is the dataframe? (do 'str' and
> > report what it says).  This will require a lot more resources and
> > given that you have about 80M possible combinations, I would assume
> > that a lot of them are probably empty.  It is having to 'split' the
> > data into the groups and then summarize.  Maybe you should use a
> > database with this combination of data.
> >
> > 2008/1/26 zhihuali <lzhtom at hotmail.com>:
> > >
> > > I repeated your experiment:
> > > > n <- 1000000
> > > > x <- data.frame(A=sample(LETTERS,n,TRUE), B=sample(letters[1:4],n,TRUE),C=sample(LETTERS[1:4], n, TRUE), data=runif(n))
> > > > system.time(x.agg <- aggregate(x$data, list(x$A, x$B, x$C), mean))
> > >   user  system elapsed
> > >  1.824   0.212   2.038
> > >
> > >
> > > Now I use my own data:
> > > > length(levels(group))
> > > [1] 7049
> > > > length(levels(type))
> > > [1] 11704
> > > > y<-data.frame(group,type,signal)
> > > > system.time(y.agg <- aggregate(y$signal, list(y$group,y$type), mean))
> > >   (I killed it after 30 minutes)
> > >
> > >
> > >
> > > > Date: Sat, 26 Jan 2008 19:55:51 -0500
> > > > From: jholtman at gmail.com
> > > > To: lzhtom at hotmail.com
> > > > Subject: Re: [R] Comparison of aggregate in R and group by in mysql
> > > > CC: r-help at stat.math.ethz.ch
> > >
> > > >
> > > > How large is your dataframe?  How much memory do you have on your
> > > > system?  Are you paging?  Here is a test I ran with a data frame with
> > > > 1,000,000 entries and it seems to be fast:
> > > >
> > > > > n <- 1000000
> > > > > x <- data.frame(A=sample(LETTERS,n,TRUE), B=sample(letters[1:4],n,TRUE),
> > > > +     C=sample(LETTERS[1:4], n, TRUE), data=runif(n))
> > > > > system.time(x.agg <- aggregate(x$data, list(x$A, x$B, x$C), mean))
> > > >    user  system elapsed
> > > >    2.65    0.34    3.00
> > > > >
> > > >
> > > > On Jan 26, 2008 6:45 PM, zhihuali <lzhtom at hotmail.com> wrote:
> > > > >
> > > > > Hi, netters,
> > > > >
> > > > > First of all, thanks a lot for all the prompt replies to my earlier question about "merging" data frames in R.
> > > > > Actually that's an equivalence to the "join" clause in mysql.
> > > > >
> > > > > Now I have another question. Suppose I have a data frame X with lots of columns/variables:
> > > > > Name, Age,Group, Type, Salary.
> > > > > I wanna do a subtotal of salaries:
> > > > > aggregate(X$Salary, by=list(X$Group,X$Age,X$Type),Fun=mean)
> > > > >
> > > > > When the levels of Group and Type are huge, it took R forever to finish the aggregation.
> > > > > And I used gc to find that the memory usage was big too.
> > > > >
> > > > > However, in mysql, it took seconds to finish a similar job:
> > > > > select Group,Age,Type ,avg(Salary)  from X group by  Group,Age,Type
> > > > >
> > > > > Is it because mysql is superior in doing such kind of things? Or my R command is not efficient enough? Why did R have to consume huge memories to do the aggregation?
> > > > >
> > > > > Thanks again!
> > > > >
> > > > > Zhihua Li
> > > > >
> > > > > _________________________________________________________________
> > > > > 天凉了，添衣了，心动了，"七件"了
> > > > > http://get.live.cn
> > > > >        [[alternative HTML version deleted]]
> > > > >
> > > > >
> > > > > ______________________________________________
> > > > > R-help at r-project.org mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > > > and provide commented, minimal, self-contained, reproducible code.
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jim Holtman
> > > > Cincinnati, OH
> > > > +1 513 646 9390
> > > >
> > > > What is the problem you are trying to solve?
> > >
> > > _________________________________________________________________
> > > MSN圣诞礼物火热登场，免费发放中，快来领取吧！
> > > http://im.live.cn/emoticons/?ID=18
> > >        [[alternative HTML version deleted]]
> > >
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> > >
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 646 9390
> >
> > What is the problem you are trying to solve?
>
> _________________________________________________________________
> 天凉了，添衣了，心动了，"七件"了
> http://get.live.cn
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?