[R] Select rows based on matching conditions and logical operators
William Dunlap
wdunlap at tibco.com
Thu Jul 26 00:37:27 CEST 2012
Any of those would work. I wish ave() did that part of the job.
I don't think there is any reason it shouldn't. The following only
needs to call FUN three times, not 9:
> z <- ave(LETTERS[1:3], 1:3, 1:3, FUN=function(x)print(x))
[1] "A"
character(0)
character(0)
character(0)
[1] "B"
character(0)
character(0)
character(0)
[1] "C"
> z
[1] "A" "B" "C"
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: Bert Gunter [mailto:gunter.berton at gene.com]
> Sent: Wednesday, July 25, 2012 3:04 PM
> To: Rui Barradas
> Cc: William Dunlap; r-help
> Subject: Re: [R] Select rows based on matching conditions and logical operators
>
> Wouldn't
>
> > interaction(..., drop=TRUE)
>
> be the same, but terser in this situation?
>
> Also I tend to use paste() for this, i.e. instead of
>
> > interaction(v1,v2, drop=TRUE)
>
> simply
>
> > paste(v1,v2)
>
> Again, this seems shorter and simpler -- but are there good reasons to
> prefer the use of interaction()?
>
> Cheers,
> Bert
>
> On Wed, Jul 25, 2012 at 2:51 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> > Hello,
> >
> > You're right, thanks.
> > In my solution, I had tried to keep to the op as much as possible. A glance
> > at it made me realize that one change only would do the job, and that was
> > it, no performance worries.
> > I particularly liked the interaction/droplevels trick.
> >
> > Rui Barradas
> >
> > Em 25-07-2012 22:13, William Dunlap escreveu:
> >>
> >> Rui,
> >> Your solution works, but it can be faster for large data.frames if you
> >> compute
> >> the indices of the desired rows of the input data.frame and then using one
> >> subscripting call to select the rows instead of splitting the input
> >> data.frame
> >> into a list of data.frames, extracting the desired row from each
> >> component,
> >> and then calling rbind to put the rows together again. E.g., compare your
> >> approach, which I've put into the function f1
> >> f1 <- function (dataFrame) {
> >> retval <- with(dataFrame, sapply(split(dataFrame, list(PTID,
> >> Year)), function(x) if (nrow(x))
> >> x[which.max(x$Count), ]))
> >> retval <- do.call(rbind, retval)
> >> rownames(retval) <- 1:nrow(retval)
> >> retval
> >> }
> >> with one that computes a logical subscripting vector (by splitting just
> >> the
> >> Counts vector, not the whole data.frame)
> >> f2 <- function (dataFrame) {
> >> keep <- as.logical(ave(dataFrame$Count,
> >> droplevels(interaction(dataFrame$PTID,
> >> dataFrame$Year)), FUN = function(x) if (length(x)) seq_along(x)
> >> ==
> >> which.max(x)))
> >> dataFrame[keep, ]
> >> }
> >>
> >> The both compute the same thing, aside from the fact that the rows
> >> are in a different order (f2 keeps the order of the original data.frame)
> >> and f2 leaves the original row label with the row.
> >>>
> >>> f1(df1)
> >>
> >> PGID PTID Year Visit Count
> >> 1 6755 53122 2008 3 1
> >> 2 6755 53121 2009 1 0
> >> 3 6755 53122 2009 3 2
> >>>
> >>> f2(df1)
> >>
> >> PGID PTID Year Visit Count
> >> 1 6755 53121 2009 1 0
> >> 6 6755 53122 2008 3 1
> >> 9 6755 53122 2009 3 2
> >> When there are a lot of output rows the f2 can be quite a bit faster.
> >>
> >> (I put the call to droplevels(interaction(...)) into the call to ave
> >> because ave
> >> can waste a lot of time calling FUN for nonexistent interaction levels.)
> >>
> >> Bill Dunlap
> >> Spotfire, TIBCO Software
> >> wdunlap tibco.com
> >>
> >>
> >>> -----Original Message-----
> >>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> >>> On
> >>> Behalf Of Rui Barradas
> >>> Sent: Wednesday, July 25, 2012 10:24 AM
> >>> To: kborgmann
> >>> Cc: r-help
> >>> Subject: Re: [R] Select rows based on matching conditions and logical
> >>> operators
> >>>
> >>> Hello,
> >>>
> >>> Apart from the output order this does it.
> >>> (I have changed 'df' to 'df1', 'df' is an R function, the F distribution
> >>> density.)
> >>>
> >>>
> >>> df1 <- read.table(text="
> >>> PGID PTID Year Visit Count
> >>> 6755 53121 2009 1 0
> >>> 6755 53121 2009 2 0
> >>> 6755 53121 2009 3 0
> >>> 6755 53122 2008 1 0
> >>> 6755 53122 2008 2 0
> >>> 6755 53122 2008 3 1
> >>> 6755 53122 2009 1 0
> >>> 6755 53122 2009 2 1
> >>> 6755 53122 2009 3 2", header=TRUE)
> >>>
> >>>
> >>> df2 <- with(df1, sapply(split(df1, list(PTID, Year)),
> >>> function(x) if (nrow(x)) x[which.max(x$Count), ]))
> >>> df2 <- do.call(rbind, df2)
> >>> rownames(df2) <- 1:nrow(df2)
> >>> df2
> >>>
> >>> which.max(9, not which().
> >>>
> >>> Hope this helps,
> >>>
> >>> Rui Barradas
> >>> Em 25-07-2012 18:10, kborgmann escreveu:
> >>>>
> >>>> Hi,
> >>>> I have a dataset in which I would like to select rows based on matching
> >>>> conditions and return the maximum value of a variable else return one
> >>>> row if
> >>>> duplicate counts exist. My dataset looks like this:
> >>>> PGID PTID Year Visit Count
> >>>> 6755 53121 2009 1 0
> >>>> 6755 53121 2009 2 0
> >>>> 6755 53121 2009 3 0
> >>>> 6755 53122 2008 1 0
> >>>> 6755 53122 2008 2 0
> >>>> 6755 53122 2008 3 1
> >>>> 6755 53122 2009 1 0
> >>>> 6755 53122 2009 2 1
> >>>> 6755 53122 2009 3 2
> >>>>
> >>>> I would like to select rows if PTID and Year match and return the
> >>>> maximum
> >>>> count else return one row if counts are the same, such that I get this
> >>>> output
> >>>> PGID PTID Year Visit Count
> >>>> 6755 53121 2009 1 0
> >>>> 6755 53122 2008 3 1
> >>>> 6755 53122 2009 3 2
> >>>>
> >>>> I tried the following code and the output is almost correct but
> >>>> duplicate
> >>>> values were included
> >>>> df2<-with(df, sapply(split(df, list(PTID, Year)),
> >>>> function(x) if (nrow(x)) x[which(x$Count==max(x$Count)),]))
> >>>> df<-do.call(rbind,df)
> >>>> rownames(df)<-1:nrow(df)
> >>>>
> >>>> Any suggestions?
> >>>> Thanks much for your responses!
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> View this message in context:
> >>>> http://r.789695.n4.nabble.com/Select-rows-based-
> >>>
> >>> on-matching-conditions-and-logical-operators-tp4637809.html
> >>>>
> >>>> Sent from the R help mailing list archive at Nabble.com.
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-
> biostatistics/pdb-ncb-home.htm
More information about the R-help
mailing list