[R] Select rows based on matching conditions and logical operators

Rui Barradas ruipbarradas at sapo.pt
Wed Jul 25 23:51:17 CEST 2012


Hello,

You're right, thanks.
In my solution, I had tried to keep to the op as much as possible. A 
glance at it made me realize that one change only would do the job, and 
that was it, no performance worries.
I particularly liked the interaction/droplevels trick.

Rui Barradas

Em 25-07-2012 22:13, William Dunlap escreveu:
> Rui,
>    Your solution works, but it can be faster for large data.frames if you compute
> the indices of the desired rows of the input data.frame and then using one
> subscripting call to select the rows  instead of splitting the input data.frame
> into a list of data.frames, extracting the desired row from each component,
> and then calling rbind to put the rows together again.  E.g., compare your
> approach, which I've put into the function f1
>    f1 <- function (dataFrame)  {
>        retval <- with(dataFrame, sapply(split(dataFrame, list(PTID,
>            Year)), function(x) if (nrow(x))
>            x[which.max(x$Count), ]))
>        retval <- do.call(rbind, retval)
>        rownames(retval) <- 1:nrow(retval)
>        retval
>    }
> with one that computes a logical subscripting vector (by splitting just the
> Counts vector, not the whole data.frame)
>    f2 <- function (dataFrame)  {
>        keep <- as.logical(ave(dataFrame$Count, droplevels(interaction(dataFrame$PTID,
>            dataFrame$Year)), FUN = function(x) if (length(x)) seq_along(x) ==
>            which.max(x)))
>        dataFrame[keep, ]
>    }
>
> The both compute the same thing, aside from the fact that the rows
> are in a different order (f2 keeps the order of the original data.frame)
> and f2 leaves the original row label with the row.
>> f1(df1)
>    PGID  PTID Year Visit Count
> 1 6755 53122 2008     3     1
> 2 6755 53121 2009     1     0
> 3 6755 53122 2009     3     2
>> f2(df1)
>    PGID  PTID Year Visit Count	
> 1 6755 53121 2009     1     0
> 6 6755 53122 2008     3     1
> 9 6755 53122 2009     3     2
> When there are a lot of output rows the f2 can be quite a bit faster.
>
> (I put the call to droplevels(interaction(...)) into the call to ave because ave
> can waste a lot of time calling FUN for nonexistent interaction levels.)
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
>> Behalf Of Rui Barradas
>> Sent: Wednesday, July 25, 2012 10:24 AM
>> To: kborgmann
>> Cc: r-help
>> Subject: Re: [R] Select rows based on matching conditions and logical operators
>>
>> Hello,
>>
>> Apart from the output order this does it.
>> (I have changed 'df' to 'df1', 'df' is an R function, the F distribution
>> density.)
>>
>>
>> df1 <- read.table(text="
>> PGID PTID Year Visit  Count
>> 6755 53121 2009 1 0
>> 6755 53121 2009 2 0
>> 6755 53121 2009 3 0
>> 6755 53122 2008 1 0
>> 6755 53122 2008 2 0
>> 6755 53122 2008 3 1
>> 6755 53122 2009 1 0
>> 6755 53122 2009 2 1
>> 6755 53122 2009 3 2", header=TRUE)
>>
>>
>> df2 <- with(df1, sapply(split(df1, list(PTID, Year)),
>>       function(x) if (nrow(x)) x[which.max(x$Count), ]))
>> df2 <- do.call(rbind, df2)
>> rownames(df2) <- 1:nrow(df2)
>> df2
>>
>> which.max(9, not which().
>>
>> Hope this helps,
>>
>> Rui Barradas
>> Em 25-07-2012 18:10, kborgmann escreveu:
>>> Hi,
>>> I have a dataset in which I would like to select rows based on matching
>>> conditions and return the maximum value of a variable else return one row if
>>> duplicate counts exist.  My dataset looks like this:
>>> PGID	PTID	Year	 Visit  Count
>>> 6755	53121	2009	1	0
>>> 6755	53121	2009	2	0
>>> 6755	53121	2009	3	0
>>> 6755	53122	2008	1	0
>>> 6755	53122	2008	2	0
>>> 6755	53122	2008	3	1
>>> 6755	53122	2009	1	0
>>> 6755	53122	2009	2	1
>>> 6755	53122	2009	3	2
>>>
>>> I would like to select rows if PTID and Year match and return the maximum
>>> count else return one row if counts are the same, such that I get this
>>> output
>>> PGID	PTID	Year	 Visit  Count
>>> 6755	53121	2009	1	0
>>> 6755	53122	2008	3	1
>>> 6755	53122	2009	3	2
>>>
>>> I tried the following code and the output is almost correct but duplicate
>>> values were included
>>> df2<-with(df, sapply(split(df, list(PTID, Year)),
>>> function(x) if (nrow(x)) x[which(x$Count==max(x$Count)),]))
>>> df<-do.call(rbind,df)
>>> rownames(df)<-1:nrow(df)
>>>
>>> Any suggestions?
>>> Thanks much for your responses!
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://r.789695.n4.nabble.com/Select-rows-based-
>> on-matching-conditions-and-logical-operators-tp4637809.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list