[R] Sampling
Tim Hesterberg
timh at insightful.com
Wed Feb 6 19:49:24 CET 2008
> I want to generate different samples using the
>followindg code:
>
>g<-sample(LETTERS[1:2], 24, replace=T)
>
> How can I specify that I need 12 "A"s and 12 "B"s?
I introduced the concept of "sampling with minimal replacement" into the
S-PLUS version of sample to handle things like this:
sample(LETTERS[1:2], 24, minimal = T)
This is very useful in variance reduction applications, to approximately
stratify but with introducing bias. I'd like to see this in R.
I'll raise a related issue - sampling with unequal probabilities,
without replacement. R does the wrong thing, in my opinion:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
> table(values)
values
1 2 3
834 574 592
The selection probabilities are not proportional to the specified
probabilities.
In contrast, in S-PLUS:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
> table(values)
1 2 3
1000 501 499
You can specify minimal = FALSE to get the same behavior as R:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25), minimal = F))
> table(values)
1 2 3
844 592 564
There is a reason this is associated with the concept of sampling with
minimal replacement. Consider for example:
sample(1:4, size = 3, prob = 1:4/10)
The expected frequencies of (1,2,3,4) should be proportional
to size*prob = c(.3,.6,.9,1.2). That isn't possible when sampling
without replacement. Sampling with minimal replacement allows this;
observation 4 is included in every sample, and is included twice in
20% of the samples.
Tim Hesterberg
Disclaimer - these are my opinions, not those of my employer.
========================================================
| Tim Hesterberg Senior Research Scientist |
| timh at insightful.com Insightful Corp. |
| (206)802-2319 1700 Westlake Ave. N, Suite 500 |
| (206)283-8691 (fax) Seattle, WA 98109-3044, U.S.A. |
| www.insightful.com/Hesterberg |
========================================================
I'll teach short courses:
Advanced Programming in S-PLUS: San Antonio TX, March 26-27, 2008.
Bootstrap Methods and Permutation Tests: San Antonio, March 28, 2008.
More information about the R-help
mailing list