[BioC] limma multiple groups comparison produces different pvalue comparing with two group comparsion

Wed Mar 14 14:12:34 CET 2012

Hi Chunxuan,

On 3/13/2012 4:53 AM, Shao, Chunxuan wrote:
> Hi everyone:
>
> I am trying limma for multiple group comparison, and found that the pvalue of pairwise comparison is different from the two group comparsion.
>
> A simple example:
>
> ## multiple groups
> sd<- 0.3*sqrt(4/rchisq(100,df=4))
> y2<- matrix(rnorm(100*9,sd=sd),100,9)
> rownames(y2)<- paste("Gene",1:100)
> y2[1:2,4:6]<- y2[1:2,4:6] + 2
> y2[1:2,7:9]<- y2[1:2,7:9] + 2
> f<- factor(c(rep("A",3),rep("B",3),rep("C",3)),levels=c("A","B","C"))
> design<- model.matrix(~0+f)
> colnames(design)<- c("A","B","C")
> fit2<- lmFit(y2,design)
> contrast.matrix<- makeContrasts("B-A", "C-A", "C-B", levels=design)
> fit2<- contrasts.fit(fit2, contrast.matrix)
> fit2<- eBayes(fit2)
> topTable(fit2, adjust="BH",coef=1)
>
>          ID      logFC         t      P.Value   adj.P.Val          B
> 1   Gene 1  1.8961975  8.697056 1.038672e-05 0.001038672  3.9011567
> 2   Gene 2  2.3140816  6.104506 1.690945e-04 0.008454727  0.9859443
> 96 Gene 96 -0.8959120 -4.039216 2.855951e-03 0.095198377 -1.9787088
> 49 Gene 49 -0.4983110 -2.776108 2.128499e-02 0.462011867 -4.0378662
> 86 Gene 86  0.7288518  2.726366 2.310059e-02 0.462011867 -4.1198456
> 74 Gene 74 -0.8456774 -2.367919 4.171492e-02 0.695248750 -4.7045985
> 50 Gene 50 -1.5597167 -2.107329 6.396187e-02 0.783183079 -5.1177738
> 46 Gene 46 -0.6530683 -2.028684 7.269084e-02 0.783183079 -5.2394326
> 8   Gene 8 -0.3348446 -1.986208 7.786841e-02 0.783183079 -5.3044289
> 89 Gene 89 -0.4574447 -1.942438 8.356956e-02 0.783183079 -5.3708387
>
> ## two groups
> y<- y2[,1:6]
> design<- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))
> options(digit=3)
> fit<- lmFit(y,design)
> fit<- eBayes(fit)
> topTable(fit, adjust="BH",coef= "Grp2vs1")
>
>          ID      logFC         t      P.Value  adj.P.Val          B
> 1   Gene 1  1.8961975  7.962269 0.0001316933 0.01316933  1.6748248
> 2   Gene 2  2.3140816  6.190071 0.0005783671 0.02891836  0.1114107
> 96 Gene 96 -0.8959120 -4.124112 0.0050946000 0.16982000 -2.2231052
> 64 Gene 64 -0.6611682 -3.824833 0.0073401109 0.18350277 -2.6140362
> 86 Gene 86  0.7288518  3.503703 0.0110265957 0.22053191 -3.0478852
> 49 Gene 49 -0.4983110 -2.922543 0.0239288895 0.39881483 -3.8654279
> 74 Gene 74 -0.8456774 -2.411705 0.0489411131 0.69915876 -4.6045844
> 89 Gene 89 -0.4574447 -2.282469 0.0588626872 0.73578359 -4.7916531
> 50 Gene 50 -1.5597167 -2.064728 0.0804679751 0.79599004 -5.1040241
> 46 Gene 46 -0.6530683 -2.020227 0.0857867706 0.79599004 -5.1671760
>
>
> In this case,  in the multiple group p_value is much smaller than the two group.
> Any suggestion to this ? Which is the appropriate way to do if I want to do the pairwise comparison among three groups?

Technically either way is appropriate, but the first case will be more 
powerful. In the second case you are doing a conventional t-test, where 
the denominator is the standard error of the mean (SEM), computed from 
the two groups under consideration. In the first case you are fitting a 
contrast in the context of a larger linear model, in which case the 
denominator is the sums of squares of error (SSE), which is an estimate 
of the intra-group variability over all groups.

Both the SEM and SSE are measuring the same thing, which is the within 
group variability. Measures of variability tend to be inefficient 
statistics, which means that the accuracy of the statistic is highly 
dependent on the number of observations (and they can be quite 
inaccurate with few observations). This is the underlying premise of the 
eBayes step in limma, which uses the variability estimate over all genes 
(which will be pretty accurate) to adjust the variability estimate of an 
individual gene (which may not be accurate, especially with few samples).

In your case, the SSE computed over three groups will tend to be a more 
accurate (and generally smaller) measure of the within group variability 
than the SEM computed over two groups. Since the numerator of your 
statistic is the same in both cases, computing a smaller denominator 
will increase the t-statistic, which will result in smaller p-values, 
and hence more power to detect differences.

Best,

Jim

>
> Thanks in advance.
>
> --
> Chunxuan
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099