[BioC] Very low P-values in limma

Wolfgang Huber whuber at embl.de
Thu Oct 22 14:21:53 CEST 2009


Dear Paul

please do not do something like:

  # this line of code will remove any negative values which cause errors
  # further on
  RG$G[RG$G<0] <- 0

First, vsn is intended to work with data that is not mutilated like 
that. There is ample literature on this topic. The vsn vignette is a 
good start.

Second, when you use a parametric test (which is what 
lmFit/eBayes/topTable do), do not arbitrarily replace parts of your data 
with phantasy values, it can invalidate the assumptions underlying the 
p-value computation.

I am not sure whether this already fixes all your problems, but this is 
certainly one of them.

	Best wishes
	Wolfgang



Paul Geeleher ha scritto:
> The value of df.prior is 3.208318. Hopefully somebody can help me here
> because I'm still really at a loss on why these values are so low.
> 
> Here's the script. I'm fairly sure it conforms to the documentation:
> 
> library(affy)
> 
> setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/')
> 
> # create a color pallete for boxplots
> cols <- brewer.pal(12, "Set3")
> 
> # This defines the column name of the mean Cy3 foreground intensites
> #Cy3 <- "F532 Median"
> # background corrected value
> Cy3 <- "F532 Median - B532"
> 
> # This defines the column name of the mean Cy3 background intensites
> Cy3b <- "B532 Mean"
> 
> # Read the targets file (see limmaUsersGuide for info on how to create this)
> targets <- readTargets("targets.csv")
> 
> # read values from gpr file
> # because there is no red channel, read Rb & R to be the same values as G
> and Gb
> # this is a type of hack that allows the function to work
> RG <- read.maimages( targets$FileName,
> source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b))
> 
> # remove the extraneous values red channel values
> RG$R <- NULL
> RG$Rb <- NULL
> 
> # this line of code will remove any negative values which cause errors
> further on
> RG$G[RG$G<0] <- 0
> 
> # create a pData for the data just read
> # this indicates which population each array belongs to
> # a are "her-" and b are "her+" (almost certain)
> pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b'))
> rownames(pData) <-  RG$targets$FileName
> 
> # create design matrix
> design <- model.matrix(~factor(pData$population))
> 
> # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in their
> name
> # so the grep function can be used to extract all of these, removing
> # all control signals and printing buffers etc.
> # You need to check your .gpr files to find which signals you should
> extract.
> miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name))
> RG.final <- RG[miRs, ]
> 
> # load vsn library, this contains the normalization functions
> library('vsn')
> 
> # Do VSN normalization and output as vns object
> mat <- vsnMatrix(RG.final$G)
> 
> # insert rownames into matrix with normalized data
> # this will mean that the gene names will appear in our final output
> rownames(mat at hx) <- RG.final$genes$Name
> 
> # my .gpr files contain 4 "within-array replicates" of each miRNA.
> # We need to make full use of this information by calculating the duplicate
> correlation
> # in order to use the duplicateCorrelation() function below,
> # we need to make sure that the replicates of each gene appear in sequence
> in this matrix,
> # so we sort the normalized values so the replicate groups of 4 miRs appear
> in sequence
> mat at hx <- mat at hx[order(rownames(mat at hx)), ]
> 
> # calculate duplicate correlation between the 4 replicates on each array
> corfit <- duplicateCorrelation(mat, design, ndups=4)
> 
> # fit the linear model, including info on duplicates and correlation
> fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus)
> 
> # calculate values using ebayes
> ebayes <- eBayes(fit)
> 
> # output a list of top differnetially expressed genes...
> topTable(ebayes, coef = 2, adjust = "BH", n = 10)
> 
> 
> -Paul.
> 
> 
> On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at wehi.edu.au> wrote:
> 
>> Dear Paul:
>>
>>   The low p-values you have got are not surprising to me. I have got even
>> lower FDR adjusted p-values than yours. This just means you got
>> significantly differentially expressed genes. But on the other hand, I did
>> see much higher adjusted p-values in some of my microarray analyses. In that
>> case, I will explore the data in more depth such as looking at batch effect
>> etc.
>>
>> Cheers,
>> Wei
>>
>>
>> Paul Geeleher wrote:
>>
>>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering about
>>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs
>>> Group
>>> B' experimental design. 4 arrays in one group, 3 in the other and 4
>>> duplicate spots for each miRNA on each array.
>>>
>>> The lowest adjusted p-values in the differential expression analysis are
>>> in
>>> the region of 10^-7.
>>>
>>> Its been pointed out to me that plugging the point values from each sample
>>> into a regular t-test you get p=0.008, which then also needs to take the
>>> multiple test hit. Can anybody explain why limma is giving me such lower
>>> values and if they are valid?
>>>
>>> I can provide more information if required.
>>>
>>> Thanks,
>>>
>>> Paul.
>>>
>>>
>>>
> 
> 

-- 

Best wishes
      Wolfgang

-------------------------------------------------------
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber



More information about the Bioconductor mailing list