[R] Sampling
Thomas Lumley
tlumley at u.washington.edu
Thu Feb 7 18:34:08 CET 2008
On Wed, 6 Feb 2008, Tim Hesterberg wrote:
>> Tim Hesterberg wrote:
>>> I'll raise a related issue - sampling with unequal probabilities,
>>> without replacement. R does the wrong thing, in my opinion:
>>> ...
>> Peter Dalgaard wrote:
>> But is that the right thing? ...
> (See bottom for more of the previous messages.)
>
>
> First, consider the common case, where size * max(prob) < 1 --
> sampling with unequal probabilities without replacement.
>
> Why do people do sampling with unequal probabilities, without
> replacement? A typical application would be sampling with probability
> proportional to size, or more generally where the desire is that
> selection probabilities match some criterion.
In real survey PPS sampling it also matters what the pairwise joint
selection probabilities are -- and there are *many* algorithms, with
different properties. Yves Till'e has written an R package that implements
some of them, and the pps package implements others.
> The default S-PLUS algorithm does that. The selection probabilities
> at each of step 1, 2, ..., size are all equal to prob, and the overall
> probabilities of selection are size*prob.
Umm, no, they aren't.
Splus 7.0.3 doesn't say explicitly what its algorithm is, but is happy to
take a sample of size 10 from a population of size 10 with unequal
sampling probabilities. The overall selection probability *can't* be
anything other than 1 for each element -- sampling without replacement and
proportional to any other set of probabilities is impossible.
Even in a milder case -- samples of size 5 from 1:10 with probabilities
proportional to 1:10 -- the deviation is noticeable in 1000 replications.
In this case sampling with the specified probabilities is actually
possible, but S-PLUS doesn't do it.
Now, it might be useful to add another replace=FALSE sampler to sample(),
such as the newish Conditional Poisson Sampler based on the work of
S.X.Chen. This does give correct marginal probabilities of inclusion, and
the pairwise joint probabilities are not too hard to compute.
I don't think that dropping the current sequential PPS implementation is
a good idea. The help page does explain the algorithm, though it might be
useful to add an explicit note that the marginal probabilities of sampling
are not the supplied probabilities.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list