[R] Select rows based on matching conditions and logical operators

William Dunlap wdunlap at tibco.com
Thu Jul 26 00:37:27 CEST 2012


Any of those would work.  I wish ave() did that part of the job.
I don't think there is any reason it shouldn't.  The following only
needs to call FUN three times, not 9:
   > z <- ave(LETTERS[1:3], 1:3, 1:3, FUN=function(x)print(x))
   [1] "A"
   character(0)
   character(0)
   character(0)
   [1] "B"
   character(0)
   character(0)
   character(0)
   [1] "C"
   > z
   [1] "A" "B" "C"

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: Bert Gunter [mailto:gunter.berton at gene.com]
> Sent: Wednesday, July 25, 2012 3:04 PM
> To: Rui Barradas
> Cc: William Dunlap; r-help
> Subject: Re: [R] Select rows based on matching conditions and logical operators
> 
> Wouldn't
> 
> > interaction(..., drop=TRUE)
> 
> be the same, but terser in this situation?
> 
> Also I tend to use paste() for this, i.e. instead of
> 
> > interaction(v1,v2, drop=TRUE)
> 
> simply
> 
> > paste(v1,v2)
> 
> Again, this seems shorter and simpler -- but are there good reasons to
> prefer the use of interaction()?
> 
> Cheers,
> Bert
> 
> On Wed, Jul 25, 2012 at 2:51 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> > Hello,
> >
> > You're right, thanks.
> > In my solution, I had tried to keep to the op as much as possible. A glance
> > at it made me realize that one change only would do the job, and that was
> > it, no performance worries.
> > I particularly liked the interaction/droplevels trick.
> >
> > Rui Barradas
> >
> > Em 25-07-2012 22:13, William Dunlap escreveu:
> >>
> >> Rui,
> >>    Your solution works, but it can be faster for large data.frames if you
> >> compute
> >> the indices of the desired rows of the input data.frame and then using one
> >> subscripting call to select the rows  instead of splitting the input
> >> data.frame
> >> into a list of data.frames, extracting the desired row from each
> >> component,
> >> and then calling rbind to put the rows together again.  E.g., compare your
> >> approach, which I've put into the function f1
> >>    f1 <- function (dataFrame)  {
> >>        retval <- with(dataFrame, sapply(split(dataFrame, list(PTID,
> >>            Year)), function(x) if (nrow(x))
> >>            x[which.max(x$Count), ]))
> >>        retval <- do.call(rbind, retval)
> >>        rownames(retval) <- 1:nrow(retval)
> >>        retval
> >>    }
> >> with one that computes a logical subscripting vector (by splitting just
> >> the
> >> Counts vector, not the whole data.frame)
> >>    f2 <- function (dataFrame)  {
> >>        keep <- as.logical(ave(dataFrame$Count,
> >> droplevels(interaction(dataFrame$PTID,
> >>            dataFrame$Year)), FUN = function(x) if (length(x)) seq_along(x)
> >> ==
> >>            which.max(x)))
> >>        dataFrame[keep, ]
> >>    }
> >>
> >> The both compute the same thing, aside from the fact that the rows
> >> are in a different order (f2 keeps the order of the original data.frame)
> >> and f2 leaves the original row label with the row.
> >>>
> >>> f1(df1)
> >>
> >>    PGID  PTID Year Visit Count
> >> 1 6755 53122 2008     3     1
> >> 2 6755 53121 2009     1     0
> >> 3 6755 53122 2009     3     2
> >>>
> >>> f2(df1)
> >>
> >>    PGID  PTID Year Visit Count
> >> 1 6755 53121 2009     1     0
> >> 6 6755 53122 2008     3     1
> >> 9 6755 53122 2009     3     2
> >> When there are a lot of output rows the f2 can be quite a bit faster.
> >>
> >> (I put the call to droplevels(interaction(...)) into the call to ave
> >> because ave
> >> can waste a lot of time calling FUN for nonexistent interaction levels.)
> >>
> >> Bill Dunlap
> >> Spotfire, TIBCO Software
> >> wdunlap tibco.com
> >>
> >>
> >>> -----Original Message-----
> >>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> >>> On
> >>> Behalf Of Rui Barradas
> >>> Sent: Wednesday, July 25, 2012 10:24 AM
> >>> To: kborgmann
> >>> Cc: r-help
> >>> Subject: Re: [R] Select rows based on matching conditions and logical
> >>> operators
> >>>
> >>> Hello,
> >>>
> >>> Apart from the output order this does it.
> >>> (I have changed 'df' to 'df1', 'df' is an R function, the F distribution
> >>> density.)
> >>>
> >>>
> >>> df1 <- read.table(text="
> >>> PGID PTID Year Visit  Count
> >>> 6755 53121 2009 1 0
> >>> 6755 53121 2009 2 0
> >>> 6755 53121 2009 3 0
> >>> 6755 53122 2008 1 0
> >>> 6755 53122 2008 2 0
> >>> 6755 53122 2008 3 1
> >>> 6755 53122 2009 1 0
> >>> 6755 53122 2009 2 1
> >>> 6755 53122 2009 3 2", header=TRUE)
> >>>
> >>>
> >>> df2 <- with(df1, sapply(split(df1, list(PTID, Year)),
> >>>       function(x) if (nrow(x)) x[which.max(x$Count), ]))
> >>> df2 <- do.call(rbind, df2)
> >>> rownames(df2) <- 1:nrow(df2)
> >>> df2
> >>>
> >>> which.max(9, not which().
> >>>
> >>> Hope this helps,
> >>>
> >>> Rui Barradas
> >>> Em 25-07-2012 18:10, kborgmann escreveu:
> >>>>
> >>>> Hi,
> >>>> I have a dataset in which I would like to select rows based on matching
> >>>> conditions and return the maximum value of a variable else return one
> >>>> row if
> >>>> duplicate counts exist.  My dataset looks like this:
> >>>> PGID    PTID    Year     Visit  Count
> >>>> 6755    53121   2009    1       0
> >>>> 6755    53121   2009    2       0
> >>>> 6755    53121   2009    3       0
> >>>> 6755    53122   2008    1       0
> >>>> 6755    53122   2008    2       0
> >>>> 6755    53122   2008    3       1
> >>>> 6755    53122   2009    1       0
> >>>> 6755    53122   2009    2       1
> >>>> 6755    53122   2009    3       2
> >>>>
> >>>> I would like to select rows if PTID and Year match and return the
> >>>> maximum
> >>>> count else return one row if counts are the same, such that I get this
> >>>> output
> >>>> PGID    PTID    Year     Visit  Count
> >>>> 6755    53121   2009    1       0
> >>>> 6755    53122   2008    3       1
> >>>> 6755    53122   2009    3       2
> >>>>
> >>>> I tried the following code and the output is almost correct but
> >>>> duplicate
> >>>> values were included
> >>>> df2<-with(df, sapply(split(df, list(PTID, Year)),
> >>>> function(x) if (nrow(x)) x[which(x$Count==max(x$Count)),]))
> >>>> df<-do.call(rbind,df)
> >>>> rownames(df)<-1:nrow(df)
> >>>>
> >>>> Any suggestions?
> >>>> Thanks much for your responses!
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> View this message in context:
> >>>> http://r.789695.n4.nabble.com/Select-rows-based-
> >>>
> >>> on-matching-conditions-and-logical-operators-tp4637809.html
> >>>>
> >>>> Sent from the R help mailing list archive at Nabble.com.
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> --
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
> 
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-
> biostatistics/pdb-ncb-home.htm


More information about the R-help mailing list