[BioC] Changing results depending on context

Tue Jun 12 15:24:07 CEST 2012

Hi January,

When you analyze all data together, the denominator of your t-statistic 
is based on an average of the intra-group variability. In general this 
will tend to make an analysis more powerful, because you are using more 
data to compute the variance estimate (which is not a very efficient 
statistic, so more data tend to improve the estimate).

Conversely, when you analyze the two groups separately, the denominator 
of your t-statistic is based only on the intra-group variance for the 
two groups under consideration.

Without having you data in hand, I can only guess what is going on. 
However, to me the most likely cause of this is a high variance in one 
or more of your 'A' groups. This would explain why when you combine the 
analyses you get fewer genes in the 'B' group and more genes in the 'A' 
group, and vice versa when you do things separately.

There are several ways to see if you have one or more problem chips in 
the A group. You could look at MA plots using plotMA(), you could use 
arrayWeights() or arrayWeightsSimple() to see if some of the arrays get 
severely down-weighted. You could also do a PCA plot of the normalized M 
values to look at the overall grouping structure.

Best,

Jim

On 6/12/2012 7:15 AM, January Weiner wrote:
> Dear all,
>
> I am evaluating a collection of two-color arrays. There are several
> sets corresponding to different tissues, there is a treatment (common
> for each tissue), and the treatment effect is analysed independently
> in each of the tissues. I am analysing the data using limma.
>
> The results in terms of the number of significant genes vary widely
> depending whether the tissues are analysed separately or jointly.
> Since the matter is even a bit more complicated, I include only
> pseudocode below. I guess that this depends on the background
> variation estimation in eBayes, but the trend is not consistent, e.g.
> for one tissue, more (many more!) significant results are found when
> analysed separately, and for another tissue the results are much
> "better" (ie. more singificant results) if analysed along other data.
>
>    targets<- readTargets( "targets.txt" )
> # Agilent two-color microarrays
>    rg.1<- read.maimages( targets, columns= list( G= "gMedianSignal",
> Gb=  "gBGMedianSignal", R="rMedianSignal",   Rb = "rBGMedianSignal"),
> annotation = c("Row", "Col","FeatureNum", "ControlType","ProbeName",
> "GeneName", "SystematicName", "Description" ))
>    rg.2<- backgroundCorrect( rg.1, method= "normexp", offset=50 )
>    rg.3<- normalizeWithinArrays( rg.2, method= "loess" )
>    rg<- normalizeBetweenArrays( rg.3, method= "quantile" )
>    rg<- rg[ rg$genes$ControlType == 0, ]
>
> # analysing all data together:
>    t2<- targetsA2C( targets )
>    design<- model.matrix( ~ 0 + t2$Target )
>    colnames( design )<- levels( t2$Target )
>    corfit<- intraspotCorrelation( rg, design )
>    fit<- lmscFit( rg, design, correlation= corfit$consensus.correlation )
>    cmtx<- makeContrasts( "A.T1-A.T2", "B.T1-B.T2", levels= design )
>    fit<- contrasts.fit( fit, cmtx )
>    fit<- eBayes( fit )
>
> Above, A and B are tissues, T1 and T2 are treatments.
>
> Here, some exemplary results showing the number of significant genes:
>
> adj.P.Val<  0.05:
> A.T1-A.T2: 5601
> B.T1-B.T2: 3914
>
> adj.P.Val<  1e-5
> A.T1-A.T2: 672
> B.T1-B.T2: 758
>
> Now, I analyse the data separately:
>
> A.targets<- targets[ 1:16, ]
> A.rg<- rg[ 1:16, ]
> A.t2<- targetsA2C( A.targets )
> A.design<- model.matrix( ~ 0 + A.t2$Target )
> colnames( A.design )<- levels( A.t2$Target )
> A.corfit<- intraspotCorrelation( A.rg, A.design )
> A.fit<- lmscFit( A.rg, A.design, correlation= A.corfit$consensus.correlation )
> A.cmtx<- makeContrasts( "A.T1-A.T2", levels= A.design )
> A.fit<- contrasts.fit( A.fit, cmtx )
> A.fit<- eBayes( A.fit )
>
>
> B.targets<- targets[ 17:32, ]
> B.rg<- rg[ 17:32, ]
> B.t2<- targetsA2C( B.targets )
> B.design<- model.matrix( ~ 0 + B.t2$Target )
> colnames( B.design )<- levels( B.t2$Target )
> B.corfit<- intraspotCorrelation( B.rg, B.design )
> B.fit<- lmscFit( B.rg, B.design, correlation= B.corfit$consensus.correlation )
> B.cmtx<- makeContrasts( "B.T1-B.T2", levels= B.design )
> B.fit<- contrasts.fit( B.fit, cmtx )
> B.fit<- eBayes( B.fit )
>
> Numbers of significant data are now much, much different:
>
> adj.P.Val<  0.05:
> A.T1-A.T2: 3347
> B.T1-B.T2: 5443
>
> adj.P.Val<  1e-5
> A.T1-A.T2: 108
> B.T1-B.T2: 1102
>
> six times less for tissue A, but almost 50% more for tissue B! How can this be?
>
> The log fold changes are stable (i.e., they do not change between the
> various sets), which is to be expected -- changing the context
> influences the moderated t statistics, but not the estimation of log
> fold change. However, the p-values are not simply smaller or larger,
> there is little correlation of the p-values from one context to
> another. I'm a bit lost in here.
>
> Kind regards,
>
> January
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099