[BioC] Very low P-values in limma

Gordon K Smyth smyth at wehi.EDU.AU
Fri Oct 23 04:45:01 CEST 2009


Dear Paul,

I'm not quite sure why you're so shocked about getting significant 
results.  Usually people complain to me when they don't get significance 
:).

limma is able to obtain much higher significance levels than a t-test 
because it (i) leverages information from the within-array replicates and 
(ii) borrows information across probes.  These two approaches in 
combination increase the effective degrees of freedom dramatically.

We would hardly go to so much time and trouble to develop new statistical 
methods if they didn't improve on ordinary t-tests.

If I were you, I'd be looking at the sizes of the fold changes, and 
whether the results seem biologically sensible and agree with known 
information, and so on.

Best wishes
Gordon


---------- original message -----------
[BioC] Very low P-values in limma
Paul Geeleher paulgeeleher at gmail.com
Thu Oct 22 11:34:01 CEST 2009

The value of df.prior is 3.208318. Hopefully somebody can help me here
because I'm still really at a loss on why these values are so low.

Here's the script. I'm fairly sure it conforms to the documentation:

library(affy)

setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/')

# create a color pallete for boxplots
cols <- brewer.pal(12, "Set3")

# This defines the column name of the mean Cy3 foreground intensites
#Cy3 <- "F532 Median"
# background corrected value
Cy3 <- "F532 Median - B532"

# This defines the column name of the mean Cy3 background intensites
Cy3b <- "B532 Mean"

# Read the targets file (see limmaUsersGuide for info on how to create 
this)
targets <- readTargets("targets.csv")

# read values from gpr file
# because there is no red channel, read Rb & R to be the same values as G
and Gb
# this is a type of hack that allows the function to work
RG <- read.maimages( targets$FileName,
source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b))

# remove the extraneous values red channel values
RG$R <- NULL
RG$Rb <- NULL

# this line of code will remove any negative values which cause errors
further on
RG$G[RG$G<0] <- 0

# create a pData for the data just read
# this indicates which population each array belongs to
# a are "her-" and b are "her+" (almost certain)
pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b'))
rownames(pData) <-  RG$targets$FileName

# create design matrix
design <- model.matrix(~factor(pData$population))

# In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in 
their
name
# so the grep function can be used to extract all of these, removing
# all control signals and printing buffers etc.
# You need to check your .gpr files to find which signals you should
extract.
miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name))
RG.final <- RG[miRs, ]

# load vsn library, this contains the normalization functions
library('vsn')

# Do VSN normalization and output as vns object
mat <- vsnMatrix(RG.final$G)

# insert rownames into matrix with normalized data
# this will mean that the gene names will appear in our final output
rownames(mat at hx) <- RG.final$genes$Name

# my .gpr files contain 4 "within-array replicates" of each miRNA.
# We need to make full use of this information by calculating the 
duplicate
correlation
# in order to use the duplicateCorrelation() function below,
# we need to make sure that the replicates of each gene appear in sequence
in this matrix,
# so we sort the normalized values so the replicate groups of 4 miRs 
appear
in sequence
mat at hx <- mat at hx[order(rownames(mat at hx)), ]

# calculate duplicate correlation between the 4 replicates on each array
corfit <- duplicateCorrelation(mat, design, ndups=4)

# fit the linear model, including info on duplicates and correlation
fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus)

# calculate values using ebayes
ebayes <- eBayes(fit)

# output a list of top differnetially expressed genes...
topTable(ebayes, coef = 2, adjust = "BH", n = 10)


-Paul.


On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at wehi.edu.au> wrote:

> Dear Paul:
>
>   The low p-values you have got are not surprising to me. I have got 
even
> lower FDR adjusted p-values than yours. This just means you got
> significantly differentially expressed genes. But on the other hand, I 
did
> see much higher adjusted p-values in some of my microarray analyses. In 
that
> case, I will explore the data in more depth such as looking at batch 
effect
> etc.
>
> Cheers,
> Wei
>
>
> Paul Geeleher wrote:
>
>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering 
about
>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs
>> Group
>> B' experimental design. 4 arrays in one group, 3 in the other and 4
>> duplicate spots for each miRNA on each array.
>>
>> The lowest adjusted p-values in the differential expression analysis 
are
>> in
>> the region of 10^-7.
>>
>> Its been pointed out to me that plugging the point values from each 
sample
>> into a regular t-test you get p=0.008, which then also needs to take 
the
>> multiple test hit. Can anybody explain why limma is giving me such 
lower
>> values and if they are valid?
>>
>> I can provide more information if required.
>>
>> Thanks,
>>
>> Paul.
>>
>>
>>
>


-- 
Paul Geeleher
School of Mathematics, Statistics and Applied Mathematics
National University of Ireland
Galway
Ireland



More information about the Bioconductor mailing list