[BioC] Very low P-values in limma

Thu Oct 22 15:22:32 CEST 2009

Hi Paul

how does

hist(topTable(ebayes, coef=2, adjust="BH", n=nrow(ebayes))$P.Value,
      breaks=100)

look like?

	Best wishes
	Wolfgang

Paul Geeleher ha scritto:
> Hi Wolfgang,
> 
> Thanks alot for your reply, I removed the offending line and the 
> p-values are still much the same. Actually the reason I had that line in 
> was that I generated a plot (PCA I think) that wouldn't accept negative 
> values.
> 
> -Paul.
> 
> 
> 
> On Thu, Oct 22, 2009 at 1:21 PM, Wolfgang Huber <whuber at embl.de 
> <mailto:whuber at embl.de>> wrote:
> 
>     Dear Paul
> 
>     please do not do something like:
> 
> 
>      # this line of code will remove any negative values which cause errors
>      # further on
>      RG$G[RG$G<0] <- 0
> 
>     First, vsn is intended to work with data that is not mutilated like
>     that. There is ample literature on this topic. The vsn vignette is a
>     good start.
> 
>     Second, when you use a parametric test (which is what
>     lmFit/eBayes/topTable do), do not arbitrarily replace parts of your
>     data with phantasy values, it can invalidate the assumptions
>     underlying the p-value computation.
> 
>     I am not sure whether this already fixes all your problems, but this
>     is certainly one of them.
> 
>            Best wishes
>            Wolfgang
> 
> 
> 
>     Paul Geeleher ha scritto:
> 
>         The value of df.prior is 3.208318. Hopefully somebody can help
>         me here
>         because I'm still really at a loss on why these values are so low.
> 
>         Here's the script. I'm fairly sure it conforms to the documentation:
> 
>         library(affy)
> 
>         setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/')
> 
>         # create a color pallete for boxplots
>         cols <- brewer.pal(12, "Set3")
> 
>         # This defines the column name of the mean Cy3 foreground intensites
>         #Cy3 <- "F532 Median"
>         # background corrected value
>         Cy3 <- "F532 Median - B532"
> 
>         # This defines the column name of the mean Cy3 background intensites
>         Cy3b <- "B532 Mean"
> 
>         # Read the targets file (see limmaUsersGuide for info on how to
>         create this)
>         targets <- readTargets("targets.csv")
> 
>         # read values from gpr file
>         # because there is no red channel, read Rb & R to be the same
>         values as G
>         and Gb
>         # this is a type of hack that allows the function to work
>         RG <- read.maimages( targets$FileName,
>         source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b))
> 
>         # remove the extraneous values red channel values
>         RG$R <- NULL
>         RG$Rb <- NULL
> 
>         # this line of code will remove any negative values which cause
>         errors
>         further on
>         RG$G[RG$G<0] <- 0
> 
>         # create a pData for the data just read
>         # this indicates which population each array belongs to
>         # a are "her-" and b are "her+" (almost certain)
>         pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b',
>         'b'))
>         rownames(pData) <-  RG$targets$FileName
> 
>         # create design matrix
>         design <- model.matrix(~factor(pData$population))
> 
>         # In my .gpr files all miRNAs contain the string "-miR-" or
>         "-let-" in their
>         name
>         # so the grep function can be used to extract all of these, removing
>         # all control signals and printing buffers etc.
>         # You need to check your .gpr files to find which signals you should
>         extract.
>         miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-",
>         RG$genes$Name))
>         RG.final <- RG[miRs, ]
> 
>         # load vsn library, this contains the normalization functions
>         library('vsn')
> 
>         # Do VSN normalization and output as vns object
>         mat <- vsnMatrix(RG.final$G)
> 
>         # insert rownames into matrix with normalized data
>         # this will mean that the gene names will appear in our final output
>         rownames(mat at hx) <- RG.final$genes$Name
> 
>         # my .gpr files contain 4 "within-array replicates" of each miRNA.
>         # We need to make full use of this information by calculating
>         the duplicate
>         correlation
>         # in order to use the duplicateCorrelation() function below,
>         # we need to make sure that the replicates of each gene appear
>         in sequence
>         in this matrix,
>         # so we sort the normalized values so the replicate groups of 4
>         miRs appear
>         in sequence
>         mat at hx <- mat at hx[order(rownames(mat at hx)), ]
> 
>         # calculate duplicate correlation between the 4 replicates on
>         each array
>         corfit <- duplicateCorrelation(mat, design, ndups=4)
> 
>         # fit the linear model, including info on duplicates and correlation
>         fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus)
> 
>         # calculate values using ebayes
>         ebayes <- eBayes(fit)
> 
>         # output a list of top differnetially expressed genes...
>         topTable(ebayes, coef = 2, adjust = "BH", n = 10)
> 
> 
>         -Paul.
> 
> 
>         On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at wehi.edu.au
>         <mailto:shi at wehi.edu.au>> wrote:
> 
>             Dear Paul:
> 
>              The low p-values you have got are not surprising to me. I
>             have got even
>             lower FDR adjusted p-values than yours. This just means you got
>             significantly differentially expressed genes. But on the
>             other hand, I did
>             see much higher adjusted p-values in some of my microarray
>             analyses. In that
>             case, I will explore the data in more depth such as looking
>             at batch effect
>             etc.
> 
>             Cheers,
>             Wei
> 
> 
>             Paul Geeleher wrote:
> 
>                 Hi folks, I'm analyzing microRNA data using limma and
>                 I'm wondering about
>                 the validity of the p-values I'm getting out. Its a
>                 simple 'Group A Vs
>                 Group
>                 B' experimental design. 4 arrays in one group, 3 in the
>                 other and 4
>                 duplicate spots for each miRNA on each array.
> 
>                 The lowest adjusted p-values in the differential
>                 expression analysis are
>                 in
>                 the region of 10^-7.
> 
>                 Its been pointed out to me that plugging the point
>                 values from each sample
>                 into a regular t-test you get p=0.008, which then also
>                 needs to take the
>                 multiple test hit. Can anybody explain why limma is
>                 giving me such lower
>                 values and if they are valid?
> 
>                 I can provide more information if required.
> 
>                 Thanks,
> 
>                 Paul.
> 
> 
> 
> 
> 
> 
>     -- 
> 
>     Best wishes
>         Wolfgang
> 
>     -------------------------------------------------------
>     Wolfgang Huber
>     EMBL
>     http://www.embl.de/research/units/genome_biology/huber
>     -------------------------------------------------------
> 
> 
> 
> 
> -- 
> Paul Geeleher
> School of Mathematics, Statistics and Applied Mathematics
> National University of Ireland
> Galway
> Ireland
> 

-- 

Best wishes
      Wolfgang

-------------------------------------------------------
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber