[R] 'simulate.p.value' for goodness of fit

Rolf Turner r.turner at auckland.ac.nz
Thu Jan 17 21:28:57 CET 2008


I'm afraid I can't follow your examples, but you seem to me to be
mixing contingency table tests and goodness of fit tests in a somewhat
incoherent fashion.

Note that your ``x2()'' function does a contingency table test, and not
a goodness of fit test.

Note that in chisq.test(), if ``x'' is a matrix, then the ``p'' argument
is ignored.

For a goodness of fit test, the sampling in chisq.test ***is***  
multinomial.  It
uses the sample() function, as a quick glance at the code would have  
told you.

I haven't the time to plough through your code and figure out what
you're driving at, but I think that part of your problem could be
the degrees of freedom.  Under the contingency table test the
degrees of freedom are 1; under the goodness of fit test the
degrees of freedom are 3.  (The vector of probabilities is
*known* under the g.o.f. test, not estimated.)

		cheers,

			Rolf Turner

On 18/01/2008, at 7:59 AM, Bob wrote:

> R Help on 'chisq.test' states that
>     "if 'simulate.p.value' is 'TRUE', the p-value is computed by Monte
>      Carlo simulation with 'B' replicates.
>
>      In the contingency table case this is done by random sampling  
> from
>      the set of all contingency tables with given marginals, and works
>      only if the marginals are positive...
>
>      In the goodness-of-fit case this is done by random sampling from
>      the discrete distribution specified by 'p', each sample being of
>      size 'n = sum(x)'."
>
> The last paragraph suggests that in the goodness-of-fit case, if p  
> gives the
>
> expected probability for each cell, this random sampling is  
> multinomial.
>
> Unfortunately, as the following examples reveal, the sampling model is
> neither
>
> multinomial nor hypergeometric - at least when it is applied to a 4- 
> fold
> table.
>
> This is rather sad as some people assume that R's chisq.test  
> function can
>
> perform a Monte Carlo test of X-squared, employing a multinomial  
> model - in
>
> other words, assuming that your data are a single random sample.
>
>
> ### Example 1.
>> x=matrix(c(1,2,3,4),nc=2)
>> # To begin with, let us apply the large-sample approximations
>> chisq.test(x,correct=TRUE)$p.value
> [1] 0.6726038
> Warning message:
> Chi-squared approximation may be incorrect in: chisq.test(x,  
> correct = TRUE)
>> chisq.test(x,correct=FALSE)$p.value
> [1] 0.7781597
> Warning message:
> Chi-squared approximation may be incorrect in: chisq.test(x, correct =
> FALSE)
>>
>> # So let us apply a 2-tailed test of O.R.=1, using a  
>> hypergeometric model
>> fisher.test(x)$p.value
> [1] 1
>> # This should also apply a hypergeometric model
>> chisq.test(x,simulate.p.value=TRUE,B=500000)$p.value
> [1] 1
>>
>> # Now we work out the expected probability for each cell
>> p=outer(c(sum(x[1,]),sum(x[2,])),c(sum(x[,1]),sum(x[,2])))/sum(x)^2
>> # But this applies a hypergeometric model, presumably because p is  
>> not
>> scalar
>> chisq.test(x,p=p,simulate.p.value=TRUE,B=500000)$p.value
> [1] 1
>> # This seems to do something different,
>> # at any rate it is much slower, and needs more memory
>> chisq.test(x[1:4],p=p[1:4],simulate.p.value=TRUE,B=10000)$p.value
> [1] 1
>> # Which would appear to be using the same model as above
>>
>> # Now let us apply an X2 test using a multinomial model
>> # (The code for this x2.test function is in Appendix 1, below.)
>> x2.test(x,R=200000)
> with cc P =  0.7316812
> conventional-P =  0.8838786
> mid-P =  0.8423058
>>
>> # All of these P-values are higher than those given by the Chi- 
>> squared
>
> approximation,
>> # but they certainly do not equal 1.
>> # But is this is an artefact of our very small sample?
>>
>>
>>
>>
>> ### Example 2.
>> # Let us try a larger sample
>> x=matrix(c(56,35,23,42),nc=2)
>>
>> # First we apply the asymptotic model
>> chisq.test(x,correct=TRUE)$p.value
> [1] 0.002222425
>> chisq.test(x,correct=FALSE)$p.value
> [1] 0.001276595
>>
>> # Now for the hypergeometric (fixed margin totals model)
>> fisher.test(x)$p.value
> [1] 0.001931078
>> chisq.test(x,simulate.p.value=TRUE,B=500000)$p.value
> [1] 0.001913996
>> p=outer(c(sum(x[1,]),sum(x[2,])),c(sum(x[,1]),sum(x[,2])))/sum(x)^2
>> chisq.test(x,p=p,simulate.p.value=TRUE,B=500000)$p.value
> [1] 0.001891996
>>
>> Next comes what we had hoped to be a multinomial test
>> chisq.test(x[1:4],p=p[1:4],simulate.p.value=TRUE,B=10000)$p.value
> [1] 0.01639836
>> # This is obviously not the same hypergeometric model as used for  
>> a > #
>
> chi-squared test.
>> # The P-value is about 10x of the approximate tests (above)
>> #  or the exact tests (below).
>>
>> x2.test(x,R=200000)
> with cc P =  0.002059990
> conventional-P =  0.001184994
> mid-P =  0.001172494
>>
>> # Whatever that chi-squared test model IS, it is certainly not
>> multinomial!
>> # Could it possibly be Poisson and, if so, why???
>
>
> ######## Appendix 1:
>
> # We have used these  functions to do a 2x2 multinomial test of X2:
>
> x2=function(y,cc=FALSE){
> y=y*1.;n=sum(y);C=cc*n/2
> a=y[1];b=y[2];c=y[3];d=y[4]
> ab=a+b;cd=c+d;ac=a+c;bd=b+d
> D=ab*cd*ac*bd
> if(D==0)x2=NA else x2=n*(abs(a*d-b*c)-C)^2/D
> x2}
>
> x2.test=function(x,R=5000){
> n=sum(x)
> p=outer(c(sum(x[1,]),sum(x[2,])),c(sum(x[,1]),sum(x[,2])))/n/n
> Q=sort(apply(rmultinom(R,n,p),2,x2))
> q=x2(x,cc=TRUE)
> pl=rank(c(q,Q),ties.method='max')[1]/(length(Q)+1)
> pe=sum(c(q,Q)==q)/(length(Q)+1);pu=1-pl+pe
> cat('with cc P = ',pu,'\n')
> q=x2(x)
> pl=rank(c(q,Q),ties.method='max')[1]/(length(Q)+1)
> pe=sum(c(q,Q)==q)/(length(Q)+1);pu=1-pl+pe
> cat('conventional-P = ',pu,'\n')
> cat('mid-P = ',pu-pe/2,'\n')}
>
> Bob
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html
> and provide commented, minimal, self-contained, reproducible code.


######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}




More information about the R-help mailing list