[R] Subsetting problem data, 2
arun
smartpink111 at yahoo.com
Fri Jul 20 13:50:46 CEST 2012
Hi,
Just a doubt regarding the dataset.
Suppose, I include two more patients F and G with different missing values as in this new dataset and run the code.
dat1<-read.table(text="
Patient Cycle V1 V2 V3 V4 V5
A 1 0.4 0.1 0.5 1.5 NA
A 2 0.3 0.2 0.5 1.6 NA
A 3 0.3 NA 0.6 1.7 NA
A 4 0.4 NA 0.4 1.8 NA
A 5 0.5 0.2 0.5 1.5 NA
B 1 0.4 NA NA NA NA
B 2 0.4 NA NA NA NA
C 1 0.9 0.9 0.9 NA NA
C 3 0.3 0.5 0.6 NA NA
C 4 NA NA NA NA NA
C 5 0.4 NA NA NA NA
D 1 0.2 0.5 NA NA NA
D 2 0.5 0.7 NA NA NA
D 4 0.6 0.4 NA NA NA
D 5 0.5 0.5 NA NA NA
E 1 0.1 NA NA NA NA
E 2 0.5 0.3 NA NA NA
E 3 0.4 0.3 NA NA NA
F 1 0.2 NA 0.2 0.5 0.1
F 2 0.5 NA 0.4 NA 0.3
F 3 0.6 NA NA 0.3 0.2
G 1 0.2 0.5 NA 0.5 0.2
G 3 0.4 0.3 0.4 NA 0.3
G 4 0.6 0.2 0.2 0.4 NA
",sep="",header=TRUE)
nms <- names(dat1)[grep("^V[1-9]$", names(dat1))]
dd <- split(dat1, dat1$Patient)
fun <- function(x) any(is.na(x)) && any(!is.na(x))
ix <- sapply(dd, function(x) Reduce(`|`, lapply(x[, nms], fun)))
dd[ix]
do.call(rbind, dd[ix])
Patient Cycle V1 V2 V3 V4 V5
A.1 A 1 0.4 0.1 0.5 1.5 NA
A.2 A 2 0.3 0.2 0.5 1.6 NA
A.3 A 3 0.3 NA 0.6 1.7 NA
A.4 A 4 0.4 NA 0.4 1.8 NA
A.5 A 5 0.5 0.2 0.5 1.5 NA
C.8 C 1 0.9 0.9 0.9 NA NA
C.9 C 3 0.3 0.5 0.6 NA NA
C.10 C 4 NA NA NA NA NA
C.11 C 5 0.4 NA NA NA NA
E.16 E 1 0.1 NA NA NA NA
E.17 E 2 0.5 0.3 NA NA NA
E.18 E 3 0.4 0.3 NA NA NA
F.19 F 1 0.2 NA 0.2 0.5 0.1
F.20 F 2 0.5 NA 0.4 NA 0.3
F.21 F 3 0.6 NA NA 0.3 0.2
G.22 G 1 0.2 0.5 NA 0.5 0.2
G.23 G 3 0.4 0.3 0.4 NA 0.3
G.24 G 4 0.6 0.2 0.2 0.4 NA
Then, patients F and G are included in the list. But, according to your initial statement, V1 and V2 are the most important variables. If B is not included in the list because B has missing values for both cycles of B, then do you know think F or G should be included in the list. Only difference is that F and G have missing values in other variables which do not behave consistently. Do you have situations like that?
A.K.
----- Original Message -----
From: Lib Gray <libgray3827 at gmail.com>
To: Rui Barradas <ruipbarradas at sapo.pt>
Cc: r-help <r-help at r-project.org>
Sent: Thursday, July 19, 2012 8:17 PM
Subject: Re: [R] Subsetting problem data, 2
I'm still getting the message (if this is what you were suggesting I try).
The data set I'm using has many more columns other than these variables;
could that be a problem? I didn't think it would affect it.
>pattern <- "L[1-8][12]"
> nms<-names(data)[grep(vars,names(data))]
Warning message:
In grep(vars, names(data)) :
argument 'pattern' has length > 1 and only the first element will be used
>
On Thu, Jul 19, 2012 at 6:55 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> Hello,
>
> Sorry, forgot about that. It's trickier to write code without a dataset to
> test it.
>
> Try
>
> pattern <- "L[1-8][12]"
>
> and after the grep print nms to see if it's right.
>
> Rui Barradas
>
> Em 20-07-2012 00:33, Lib Gray escreveu:
>
>> I'm getting this error message:
>>
>> nms<-names(data)[grep(vars,**names(data))]
>> Warning message:
>> In grep(vars, names(data)) :
>> argument 'pattern' has length > 1 and only the first element will be
>> used
>>
>> Is there a way around this?
>>
>>
>> On Thu, Jul 19, 2012 at 6:17 PM, Rui Barradas <ruipbarradas at sapo.pt>
>> wrote:
>>
>> Hello,
>>>
>>> I guess so, and I can save you some typing.
>>>
>>> vars <- sort(apply(expand.grid("L", 1:8, 1:2), 1, paste, collapse=""))
>>>
>>>
>>> Then use it and see the result.
>>>
>>> Rui Barradas
>>>
>>> Em 20-07-2012 00:00, Lib Gray escreveu:
>>>
>>> The variables are actually L11, L12, L21, L22, ... , L81, L82. Would
>>>> just
>>>> creating a vector c(L11,... ,L82) be fine? (I'm about to try it, but I
>>>> wanted to check to see if that was going to be a big issue).
>>>>
>>>> On Thu, Jul 19, 2012 at 3:27 PM, Rui Barradas <ruipbarradas at sapo.pt>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>>> Try the following. The data is your example of Patient A through E, but
>>>>> from the output of dput().
>>>>>
>>>>> dat <- structure(list(Patient = structure(c(1L, 1L, 1L, 1L, 1L, 2L,
>>>>> 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("A",
>>>>> "B", "C", "D", "E"), class = "factor"), Cycle = c(1L, 2L, 3L,
>>>>> 4L, 5L, 1L, 2L, 1L, 3L, 4L, 5L, 1L, 2L, 4L, 5L, 1L, 2L, 3L),
>>>>> V1 = c(0.4, 0.3, 0.3, 0.4, 0.5, 0.4, 0.4, 0.9, 0.3, NA, 0.4,
>>>>> 0.2, 0.5, 0.6, 0.5, 0.1, 0.5, 0.4), V2 = c(0.1, 0.2, NA,
>>>>> NA, 0.2, NA, NA, 0.9, 0.5, NA, NA, 0.5, 0.7, 0.4, 0.5, NA,
>>>>> 0.3, 0.3), V3 = c(0.5, 0.5, 0.6, 0.4, 0.5, NA, NA, 0.9, 0.6,
>>>>> NA, NA, NA, NA, NA, NA, NA, NA, NA), V4 = c(1.5, 1.6, 1.7,
>>>>> 1.8, 1.5, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>>> NA), V5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>>> NA, NA, NA, NA, NA, NA)), .Names = c("Patient", "Cycle",
>>>>> "V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA,
>>>>> -18L))
>>>>>
>>>>> dat
>>>>>
>>>>> nms <- names(dat)[grep("^V[1-9]$", names(dat))]
>>>>> dd <- split(dat, dat$Patient)
>>>>> fun <- function(x) any(is.na(x)) && any(!is.na(x))
>>>>> ix <- sapply(dd, function(x) Reduce(`|`, lapply(x[, nms], fun)))
>>>>>
>>>>> dd[ix]
>>>>> do.call(rbind, dd[ix])
>>>>>
>>>>>
>>>>> I'm assuming that the variables names are as posted, V followed by one
>>>>> single digit 1-9. To keep the Patients with complete cases just negate
>>>>> the
>>>>> index 'ix', it's a logical index.
>>>>> Note also that dput() is the best way of posting a data example.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Rui Barradas
>>>>>
>>>>> Em 19-07-2012 15:15, Lib Gray escreveu:
>>>>>
>>>>> Hello,
>>>>>
>>>>>> I didn't give enough information when I sent an query before, so I'm
>>>>>> trying
>>>>>> again with a more detailed explanation:
>>>>>>
>>>>>> In this data set, each patient has a different number of measured
>>>>>> variables
>>>>>> (they represent tumors, so some people had 2 tumors, some had 5, etc).
>>>>>> The
>>>>>> problem I have is that often in later cycles for a patient, tumors
>>>>>> that
>>>>>> were originally measured are now missing (or a "new" tumor showed up).
>>>>>> We
>>>>>> assume there are many different reasons for why a tumor would be
>>>>>> measured
>>>>>> in one cycle and not another, and so I want to subset OUT the
>>>>>> "problem"
>>>>>> patients to better study these patterns.
>>>>>>
>>>>>> An example:
>>>>>>
>>>>>> Patient Cycle V1 V2 V3 V4 V5
>>>>>> A 1 0.4 0.1 0.5 1.5 NA
>>>>>> A 2 0.3 0.2 0.5 1.6 NA
>>>>>> A 3 0.3 NA 0.6 1.7 NA
>>>>>> A 4 0.4 NA 0.4 1.8 NA
>>>>>> A 5 0.5 0.2 0.5 1.5 NA
>>>>>>
>>>>>> I want to keep patient A; they have 4 measured tumors, but tumor 2 is
>>>>>> missing data for cycles 3 and 4
>>>>>>
>>>>>> B 1 0.4 NA NA NA NA
>>>>>> B 2 0.4 NA NA NA NA
>>>>>>
>>>>>> I do not want to keep patient B; they have 1 tumor that is measure
>>>>>> consistently in both cycles
>>>>>>
>>>>>> C 1 0.9 0.9 0.9 NA NA
>>>>>> C 3 0.3 0.5 0.6 NA NA
>>>>>> C 4 NA NA NA NA NA
>>>>>> C 5 0.4 NA NA NA NA
>>>>>>
>>>>>> I do want to keep patient C; all their data is missing for cycle 4 and
>>>>>> cycle 5 only measured one tumor
>>>>>>
>>>>>> D 1 0.2 0.5 NA NA NA
>>>>>> D 2 0.5 0.7 NA NA NA
>>>>>> D 4 0.6 0.4 NA NA NA
>>>>>> D 5 0.5 0.5 NA NA NA
>>>>>>
>>>>>> I do not want patient D, their two tumors were measured each cycle
>>>>>>
>>>>>> E 1 0.1 NA NA NA NA
>>>>>> E 2 0.5 0.3 NA NA NA
>>>>>> E 3 0.4 0.3 NA NA NA
>>>>>>
>>>>>> I DO want patient E; they only had one tumor register in Cycle 1, but
>>>>>> cycles 2 and 3 had two tumors.
>>>>>>
>>>>>>
>>>>>> Thanks for any help!
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________******________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/******listinfo/r-help<https://stat.ethz.ch/mailman/****listinfo/r-help>
>>>>>> <https://**stat.ethz.ch/mailman/****listinfo/r-help<https://stat.ethz.ch/mailman/**listinfo/r-help>
>>>>>> >
>>>>>> <https://stat.**ethz.ch/**mailman/listinfo/r-**help<http://ethz.ch/mailman/listinfo/r-**help>
>>>>>> <http**s://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>>>>>> >
>>>>>>
>>>>>> PLEASE do read the posting guide http://www.R-project.org/**
>>>>>> posting-guide.html <http://www.R-project.org/****posting-guide.html<http://www.R-project.org/**posting-guide.html>
>>>>>> <http://www.**R-project.org/posting-guide.**html<http://www.R-project.org/posting-guide.html>
>>>>>> >
>>>>>>
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>>
>>>>>>
>
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list