[BioC] limma multiple groups comparison produces different pvalue comparing with two group comparsion
James W. MacDonald
jmacdon at uw.edu
Wed Mar 14 14:12:34 CET 2012
Hi Chunxuan,
On 3/13/2012 4:53 AM, Shao, Chunxuan wrote:
> Hi everyone:
>
> I am trying limma for multiple group comparison, and found that the pvalue of pairwise comparison is different from the two group comparsion.
>
> A simple example:
>
> ## multiple groups
> sd<- 0.3*sqrt(4/rchisq(100,df=4))
> y2<- matrix(rnorm(100*9,sd=sd),100,9)
> rownames(y2)<- paste("Gene",1:100)
> y2[1:2,4:6]<- y2[1:2,4:6] + 2
> y2[1:2,7:9]<- y2[1:2,7:9] + 2
> f<- factor(c(rep("A",3),rep("B",3),rep("C",3)),levels=c("A","B","C"))
> design<- model.matrix(~0+f)
> colnames(design)<- c("A","B","C")
> fit2<- lmFit(y2,design)
> contrast.matrix<- makeContrasts("B-A", "C-A", "C-B", levels=design)
> fit2<- contrasts.fit(fit2, contrast.matrix)
> fit2<- eBayes(fit2)
> topTable(fit2, adjust="BH",coef=1)
>
> ID logFC t P.Value adj.P.Val B
> 1 Gene 1 1.8961975 8.697056 1.038672e-05 0.001038672 3.9011567
> 2 Gene 2 2.3140816 6.104506 1.690945e-04 0.008454727 0.9859443
> 96 Gene 96 -0.8959120 -4.039216 2.855951e-03 0.095198377 -1.9787088
> 49 Gene 49 -0.4983110 -2.776108 2.128499e-02 0.462011867 -4.0378662
> 86 Gene 86 0.7288518 2.726366 2.310059e-02 0.462011867 -4.1198456
> 74 Gene 74 -0.8456774 -2.367919 4.171492e-02 0.695248750 -4.7045985
> 50 Gene 50 -1.5597167 -2.107329 6.396187e-02 0.783183079 -5.1177738
> 46 Gene 46 -0.6530683 -2.028684 7.269084e-02 0.783183079 -5.2394326
> 8 Gene 8 -0.3348446 -1.986208 7.786841e-02 0.783183079 -5.3044289
> 89 Gene 89 -0.4574447 -1.942438 8.356956e-02 0.783183079 -5.3708387
>
> ## two groups
> y<- y2[,1:6]
> design<- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))
> options(digit=3)
> fit<- lmFit(y,design)
> fit<- eBayes(fit)
> topTable(fit, adjust="BH",coef= "Grp2vs1")
>
> ID logFC t P.Value adj.P.Val B
> 1 Gene 1 1.8961975 7.962269 0.0001316933 0.01316933 1.6748248
> 2 Gene 2 2.3140816 6.190071 0.0005783671 0.02891836 0.1114107
> 96 Gene 96 -0.8959120 -4.124112 0.0050946000 0.16982000 -2.2231052
> 64 Gene 64 -0.6611682 -3.824833 0.0073401109 0.18350277 -2.6140362
> 86 Gene 86 0.7288518 3.503703 0.0110265957 0.22053191 -3.0478852
> 49 Gene 49 -0.4983110 -2.922543 0.0239288895 0.39881483 -3.8654279
> 74 Gene 74 -0.8456774 -2.411705 0.0489411131 0.69915876 -4.6045844
> 89 Gene 89 -0.4574447 -2.282469 0.0588626872 0.73578359 -4.7916531
> 50 Gene 50 -1.5597167 -2.064728 0.0804679751 0.79599004 -5.1040241
> 46 Gene 46 -0.6530683 -2.020227 0.0857867706 0.79599004 -5.1671760
>
>
> In this case, in the multiple group p_value is much smaller than the two group.
> Any suggestion to this ? Which is the appropriate way to do if I want to do the pairwise comparison among three groups?
Technically either way is appropriate, but the first case will be more
powerful. In the second case you are doing a conventional t-test, where
the denominator is the standard error of the mean (SEM), computed from
the two groups under consideration. In the first case you are fitting a
contrast in the context of a larger linear model, in which case the
denominator is the sums of squares of error (SSE), which is an estimate
of the intra-group variability over all groups.
Both the SEM and SSE are measuring the same thing, which is the within
group variability. Measures of variability tend to be inefficient
statistics, which means that the accuracy of the statistic is highly
dependent on the number of observations (and they can be quite
inaccurate with few observations). This is the underlying premise of the
eBayes step in limma, which uses the variability estimate over all genes
(which will be pretty accurate) to adjust the variability estimate of an
individual gene (which may not be accurate, especially with few samples).
In your case, the SSE computed over three groups will tend to be a more
accurate (and generally smaller) measure of the within group variability
than the SEM computed over two groups. Since the numerator of your
statistic is the same in both cases, computing a smaller denominator
will increase the t-statistic, which will result in smaller p-values,
and hence more power to detect differences.
Best,
Jim
>
> Thanks in advance.
>
> --
> Chunxuan
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list