[R] Comparison of aggregate in R and group by in mysql

jim holtman jholtman at gmail.com
Sun Jan 27 02:40:51 CET 2008


I think with your data you will be computing a matrix that is 7049 x
11704.  This will require about 700MB of memory.  What size system do
you have (how much memory)?  How big is the dataframe? (do 'str' and
report what it says).  This will require a lot more resources and
given that you have about 80M possible combinations, I would assume
that a lot of them are probably empty.  It is having to 'split' the
data into the groups and then summarize.  Maybe you should use a
database with this combination of data.

2008/1/26 zhihuali <lzhtom at hotmail.com>:
>
> I repeated your experiment:
> > n <- 1000000
> > x <- data.frame(A=sample(LETTERS,n,TRUE), B=sample(letters[1:4],n,TRUE),C=sample(LETTERS[1:4], n, TRUE), data=runif(n))
> > system.time(x.agg <- aggregate(x$data, list(x$A, x$B, x$C), mean))
>   user  system elapsed
>  1.824   0.212   2.038
>
>
> Now I use my own data:
> > length(levels(group))
> [1] 7049
> > length(levels(type))
> [1] 11704
> > y<-data.frame(group,type,signal)
> > system.time(y.agg <- aggregate(y$signal, list(y$group,y$type), mean))
>   (I killed it after 30 minutes)
>
>
>
> > Date: Sat, 26 Jan 2008 19:55:51 -0500
> > From: jholtman at gmail.com
> > To: lzhtom at hotmail.com
> > Subject: Re: [R] Comparison of aggregate in R and group by in mysql
> > CC: r-help at stat.math.ethz.ch
>
> >
> > How large is your dataframe?  How much memory do you have on your
> > system?  Are you paging?  Here is a test I ran with a data frame with
> > 1,000,000 entries and it seems to be fast:
> >
> > > n <- 1000000
> > > x <- data.frame(A=sample(LETTERS,n,TRUE), B=sample(letters[1:4],n,TRUE),
> > +     C=sample(LETTERS[1:4], n, TRUE), data=runif(n))
> > > system.time(x.agg <- aggregate(x$data, list(x$A, x$B, x$C), mean))
> >    user  system elapsed
> >    2.65    0.34    3.00
> > >
> >
> > On Jan 26, 2008 6:45 PM, zhihuali <lzhtom at hotmail.com> wrote:
> > >
> > > Hi, netters,
> > >
> > > First of all, thanks a lot for all the prompt replies to my earlier question about "merging" data frames in R.
> > > Actually that's an equivalence to the "join" clause in mysql.
> > >
> > > Now I have another question. Suppose I have a data frame X with lots of columns/variables:
> > > Name, Age,Group, Type, Salary.
> > > I wanna do a subtotal of salaries:
> > > aggregate(X$Salary, by=list(X$Group,X$Age,X$Type),Fun=mean)
> > >
> > > When the levels of Group and Type are huge, it took R forever to finish the aggregation.
> > > And I used gc to find that the memory usage was big too.
> > >
> > > However, in mysql, it took seconds to finish a similar job:
> > > select Group,Age,Type ,avg(Salary)  from X group by  Group,Age,Type
> > >
> > > Is it because mysql is superior in doing such kind of things? Or my R command is not efficient enough? Why did R have to consume huge memories to do the aggregation?
> > >
> > > Thanks again!
> > >
> > > Zhihua Li
> > >
> > > _________________________________________________________________
> > > 天凉了,添衣了,心动了,"七件"了
> > > http://get.live.cn
> > >        [[alternative HTML version deleted]]
> > >
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> > >
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 646 9390
> >
> > What is the problem you are trying to solve?
>
> _________________________________________________________________
> MSN圣诞礼物火热登场,免费发放中,快来领取吧!
> http://im.live.cn/emoticons/?ID=18
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?


More information about the R-help mailing list