[R] Select rows based on matching conditions and logical operators

Bert Gunter gunter.berton at gene.com
Thu Jul 26 00:03:41 CEST 2012


Wouldn't

> interaction(..., drop=TRUE)

be the same, but terser in this situation?

Also I tend to use paste() for this, i.e. instead of

> interaction(v1,v2, drop=TRUE)

simply

> paste(v1,v2)

Again, this seems shorter and simpler -- but are there good reasons to
prefer the use of interaction()?

Cheers,
Bert

On Wed, Jul 25, 2012 at 2:51 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> Hello,
>
> You're right, thanks.
> In my solution, I had tried to keep to the op as much as possible. A glance
> at it made me realize that one change only would do the job, and that was
> it, no performance worries.
> I particularly liked the interaction/droplevels trick.
>
> Rui Barradas
>
> Em 25-07-2012 22:13, William Dunlap escreveu:
>>
>> Rui,
>>    Your solution works, but it can be faster for large data.frames if you
>> compute
>> the indices of the desired rows of the input data.frame and then using one
>> subscripting call to select the rows  instead of splitting the input
>> data.frame
>> into a list of data.frames, extracting the desired row from each
>> component,
>> and then calling rbind to put the rows together again.  E.g., compare your
>> approach, which I've put into the function f1
>>    f1 <- function (dataFrame)  {
>>        retval <- with(dataFrame, sapply(split(dataFrame, list(PTID,
>>            Year)), function(x) if (nrow(x))
>>            x[which.max(x$Count), ]))
>>        retval <- do.call(rbind, retval)
>>        rownames(retval) <- 1:nrow(retval)
>>        retval
>>    }
>> with one that computes a logical subscripting vector (by splitting just
>> the
>> Counts vector, not the whole data.frame)
>>    f2 <- function (dataFrame)  {
>>        keep <- as.logical(ave(dataFrame$Count,
>> droplevels(interaction(dataFrame$PTID,
>>            dataFrame$Year)), FUN = function(x) if (length(x)) seq_along(x)
>> ==
>>            which.max(x)))
>>        dataFrame[keep, ]
>>    }
>>
>> The both compute the same thing, aside from the fact that the rows
>> are in a different order (f2 keeps the order of the original data.frame)
>> and f2 leaves the original row label with the row.
>>>
>>> f1(df1)
>>
>>    PGID  PTID Year Visit Count
>> 1 6755 53122 2008     3     1
>> 2 6755 53121 2009     1     0
>> 3 6755 53122 2009     3     2
>>>
>>> f2(df1)
>>
>>    PGID  PTID Year Visit Count
>> 1 6755 53121 2009     1     0
>> 6 6755 53122 2008     3     1
>> 9 6755 53122 2009     3     2
>> When there are a lot of output rows the f2 can be quite a bit faster.
>>
>> (I put the call to droplevels(interaction(...)) into the call to ave
>> because ave
>> can waste a lot of time calling FUN for nonexistent interaction levels.)
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>> On
>>> Behalf Of Rui Barradas
>>> Sent: Wednesday, July 25, 2012 10:24 AM
>>> To: kborgmann
>>> Cc: r-help
>>> Subject: Re: [R] Select rows based on matching conditions and logical
>>> operators
>>>
>>> Hello,
>>>
>>> Apart from the output order this does it.
>>> (I have changed 'df' to 'df1', 'df' is an R function, the F distribution
>>> density.)
>>>
>>>
>>> df1 <- read.table(text="
>>> PGID PTID Year Visit  Count
>>> 6755 53121 2009 1 0
>>> 6755 53121 2009 2 0
>>> 6755 53121 2009 3 0
>>> 6755 53122 2008 1 0
>>> 6755 53122 2008 2 0
>>> 6755 53122 2008 3 1
>>> 6755 53122 2009 1 0
>>> 6755 53122 2009 2 1
>>> 6755 53122 2009 3 2", header=TRUE)
>>>
>>>
>>> df2 <- with(df1, sapply(split(df1, list(PTID, Year)),
>>>       function(x) if (nrow(x)) x[which.max(x$Count), ]))
>>> df2 <- do.call(rbind, df2)
>>> rownames(df2) <- 1:nrow(df2)
>>> df2
>>>
>>> which.max(9, not which().
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>> Em 25-07-2012 18:10, kborgmann escreveu:
>>>>
>>>> Hi,
>>>> I have a dataset in which I would like to select rows based on matching
>>>> conditions and return the maximum value of a variable else return one
>>>> row if
>>>> duplicate counts exist.  My dataset looks like this:
>>>> PGID    PTID    Year     Visit  Count
>>>> 6755    53121   2009    1       0
>>>> 6755    53121   2009    2       0
>>>> 6755    53121   2009    3       0
>>>> 6755    53122   2008    1       0
>>>> 6755    53122   2008    2       0
>>>> 6755    53122   2008    3       1
>>>> 6755    53122   2009    1       0
>>>> 6755    53122   2009    2       1
>>>> 6755    53122   2009    3       2
>>>>
>>>> I would like to select rows if PTID and Year match and return the
>>>> maximum
>>>> count else return one row if counts are the same, such that I get this
>>>> output
>>>> PGID    PTID    Year     Visit  Count
>>>> 6755    53121   2009    1       0
>>>> 6755    53122   2008    3       1
>>>> 6755    53122   2009    3       2
>>>>
>>>> I tried the following code and the output is almost correct but
>>>> duplicate
>>>> values were included
>>>> df2<-with(df, sapply(split(df, list(PTID, Year)),
>>>> function(x) if (nrow(x)) x[which(x$Count==max(x$Count)),]))
>>>> df<-do.call(rbind,df)
>>>> rownames(df)<-1:nrow(df)
>>>>
>>>> Any suggestions?
>>>> Thanks much for your responses!
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://r.789695.n4.nabble.com/Select-rows-based-
>>>
>>> on-matching-conditions-and-logical-operators-tp4637809.html
>>>>
>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm



More information about the R-help mailing list