[BioC] Very low P-values in limma

Paul Geeleher paulgeeleher at gmail.com
Fri Oct 23 18:07:02 CEST 2009


Hi Claus,

corfit$consensus = .81 which I guess is pretty close to 1.

The only differences in the code are that with the
duplicateCorrelation method I finish with:

corfit <- duplicateCorrelation(mat, design, ndups=4)
fit <- lmFit(mat at hx, design, ndups=4, correlation=corfit$consensus)
ebayes <- eBayes(fit)
topTable(ebayes, coef = 2, adjust = "BH", n = 10)

The huge discrepancy in p-values is strange indeed.

Paul.

On Fri, Oct 23, 2009 at 4:25 PM, Mayer, Claus-Dieter <c.mayer at abdn.ac.uk> wrote:
> Dear Paul!
>
> What is the corfit$consensus value? If it is close to 1, using limma with duplicateCorrelation or averaging (or indeed just using any of the duplicates) should give you very similar results.
> If however corfit$consensus is close to 0, limma would indeed treat the duplicates similar to independent replicates. Obviously a corfit$consensus value close to 0 would be quite worrying as that would indicate a very high technical variability (or some subtle mistake in your code). Anyway, checking that value might help to solve the puzzle.
>
> Claus
>
>> -----Original Message-----
>> From: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-
>> bounces at stat.math.ethz.ch] On Behalf Of Paul Geeleher
>> Sent: 23 October 2009 15:13
>> To: Gordon K Smyth
>> Cc: Bioconductor mailing list
>> Subject: Re: [BioC] Very low P-values in limma
>>
>> Dear list,
>>
>> I have a query, I've rerun my analysis by averaging out the within
>> array duplicates using the avedups() function instead of
>> duplicateCorrelation().
>>
>> aveMat <- avedups(mat at hx, ndups=4, spacing=1, weights=NULL)
>> fit <- lmFit(aveMat, design)
>> ebayes <- eBayes(fit)
>> topTable(ebayes, coef = 2, adjust = "BH", n = 10)
>>
>> When I do this my most significant p-values drop from 10^-7 to .06!
>>
>> This seems dubious? It seems like to get values as low as 10^-7 the
>> duplicateCorrelation() function must be treating the within array
>> replicates in a similar way it would separate samples, this seems
>> suspicious to me...
>>
>> -Paul.
>>
>>
>>
>> On Fri, Oct 23, 2009 at 3:45 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>> > Dear Paul,
>> >
>> > I'm not quite sure why you're so shocked about getting significant
>> results.
>> >  Usually people complain to me when they don't get significance :).
>> >
>> > limma is able to obtain much higher significance levels than a t-test
>> > because it (i) leverages information from the within-array replicates
>> and
>> > (ii) borrows information across probes.  These two approaches in
>> combination
>> > increase the effective degrees of freedom dramatically.
>> >
>> > We would hardly go to so much time and trouble to develop new
>> statistical
>> > methods if they didn't improve on ordinary t-tests.
>> >
>> > If I were you, I'd be looking at the sizes of the fold changes, and
>> whether
>> > the results seem biologically sensible and agree with known information,
>> and
>> > so on.
>> >
>> > Best wishes
>> > Gordon
>> >
>> >
>> > ---------- original message -----------
>> > [BioC] Very low P-values in limma
>> > Paul Geeleher paulgeeleher at gmail.com
>> > Thu Oct 22 11:34:01 CEST 2009
>> >
>> > The value of df.prior is 3.208318. Hopefully somebody can help me here
>> > because I'm still really at a loss on why these values are so low.
>> >
>> > Here's the script. I'm fairly sure it conforms to the documentation:
>> >
>> > library(affy)
>> >
>> > setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/')
>> >
>> > # create a color pallete for boxplots
>> > cols <- brewer.pal(12, "Set3")
>> >
>> > # This defines the column name of the mean Cy3 foreground intensites
>> > #Cy3 <- "F532 Median"
>> > # background corrected value
>> > Cy3 <- "F532 Median - B532"
>> >
>> > # This defines the column name of the mean Cy3 background intensites
>> > Cy3b <- "B532 Mean"
>> >
>> > # Read the targets file (see limmaUsersGuide for info on how to create
>> this)
>> > targets <- readTargets("targets.csv")
>> >
>> > # read values from gpr file
>> > # because there is no red channel, read Rb & R to be the same values as
>> G
>> > and Gb
>> > # this is a type of hack that allows the function to work
>> > RG <- read.maimages( targets$FileName,
>> > source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b))
>> >
>> > # remove the extraneous values red channel values
>> > RG$R <- NULL
>> > RG$Rb <- NULL
>> >
>> > # this line of code will remove any negative values which cause errors
>> > further on
>> > RG$G[RG$G<0] <- 0
>> >
>> > # create a pData for the data just read
>> > # this indicates which population each array belongs to
>> > # a are "her-" and b are "her+" (almost certain)
>> > pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b'))
>> > rownames(pData) <-  RG$targets$FileName
>> >
>> > # create design matrix
>> > design <- model.matrix(~factor(pData$population))
>> >
>> > # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in
>> their
>> > name
>> > # so the grep function can be used to extract all of these, removing
>> > # all control signals and printing buffers etc.
>> > # You need to check your .gpr files to find which signals you should
>> > extract.
>> > miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name))
>> > RG.final <- RG[miRs, ]
>> >
>> > # load vsn library, this contains the normalization functions
>> > library('vsn')
>> >
>> > # Do VSN normalization and output as vns object
>> > mat <- vsnMatrix(RG.final$G)
>> >
>> > # insert rownames into matrix with normalized data
>> > # this will mean that the gene names will appear in our final output
>> > rownames(mat at hx) <- RG.final$genes$Name
>> >
>> > # my .gpr files contain 4 "within-array replicates" of each miRNA.
>> > # We need to make full use of this information by calculating the
>> duplicate
>> > correlation
>> > # in order to use the duplicateCorrelation() function below,
>> > # we need to make sure that the replicates of each gene appear in
>> sequence
>> > in this matrix,
>> > # so we sort the normalized values so the replicate groups of 4 miRs
>> appear
>> > in sequence
>> > mat at hx <- mat at hx[order(rownames(mat at hx)), ]
>> >
>> > # calculate duplicate correlation between the 4 replicates on each array
>> > corfit <- duplicateCorrelation(mat, design, ndups=4)
>> >
>> > # fit the linear model, including info on duplicates and correlation
>> > fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus)
>> >
>> > # calculate values using ebayes
>> > ebayes <- eBayes(fit)
>> >
>> > # output a list of top differnetially expressed genes...
>> > topTable(ebayes, coef = 2, adjust = "BH", n = 10)
>> >
>> >
>> > -Paul.
>> >
>> >
>> > On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at wehi.edu.au> wrote:
>> >
>> >> Dear Paul:
>> >>
>> >>  The low p-values you have got are not surprising to me. I have got
>> >
>> > even
>> >>
>> >> lower FDR adjusted p-values than yours. This just means you got
>> >> significantly differentially expressed genes. But on the other hand, I
>> >
>> > did
>> >>
>> >> see much higher adjusted p-values in some of my microarray analyses. In
>> >
>> > that
>> >>
>> >> case, I will explore the data in more depth such as looking at batch
>> >
>> > effect
>> >>
>> >> etc.
>> >>
>> >> Cheers,
>> >> Wei
>> >>
>> >>
>> >> Paul Geeleher wrote:
>> >>
>> >>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering
>> >
>> > about
>> >>>
>> >>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs
>> >>> Group
>> >>> B' experimental design. 4 arrays in one group, 3 in the other and 4
>> >>> duplicate spots for each miRNA on each array.
>> >>>
>> >>> The lowest adjusted p-values in the differential expression analysis
>> >
>> > are
>> >>>
>> >>> in
>> >>> the region of 10^-7.
>> >>>
>> >>> Its been pointed out to me that plugging the point values from each
>> >
>> > sample
>> >>>
>> >>> into a regular t-test you get p=0.008, which then also needs to take
>> >
>> > the
>> >>>
>> >>> multiple test hit. Can anybody explain why limma is giving me such
>> >
>> > lower
>> >>>
>> >>> values and if they are valid?
>> >>>
>> >>> I can provide more information if required.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Paul.
>> >>>
>> >>>
>> >>>
>> >>
>> >
>> >
>> > --
>> > Paul Geeleher
>> > School of Mathematics, Statistics and Applied Mathematics
>> > National University of Ireland
>> > Galway
>> > Ireland
>> >
>> >
>>
>>
>>
>> --
>> Paul Geeleher
>> School of Mathematics, Statistics and Applied Mathematics
>> National University of Ireland
>> Galway
>> Ireland
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> The University of Aberdeen is a charity registered in Scotland, No SC013683.
>



-- 
Paul Geeleher
School of Mathematics, Statistics and Applied Mathematics
National University of Ireland
Galway
Ireland



More information about the Bioconductor mailing list