[BioC] Using limma to identify differentially expressed genes
James W. MacDonald
jmacdon at uw.edu
Tue Apr 10 21:58:55 CEST 2012
Hi David,
On 4/10/2012 11:14 AM, David Westergaard wrote:
> Hello,
>
> I've been trying to use limma to identify genes from the following
> data: http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-21340 -
> It's a simple control vs. disease experiment
>
>
> # SDRF downloaded from above page
> SDRF<- read.table(file="E-GEOD-21340.sdrf.txt",header=TRUE,stringsAsFactors=FALSE,sep="\t")
>
> # Looking to compare Family-history negativer versus Diabetis
> controls<- SDRF[grep("Control, Family History
> Neg",SDRF$Comment..Sample_source_name.),]
> disease<- SDRF[grep("^DM",SDRF$Characteristics.DiseaseState.),]
> Batch<- rbind(controls,disease)
>
> # Read in CEL files
> mixture.batch<- ReadAffy(filenames=Batch$Array.Data.File)
>
> # Preprocess data
> mixture.processed<- expresso(mixture.batch, bgcorrect.method = "rma",
> normalize.method = "quantiles", pmcorrect.method = "pmonly",
> summary.method = "medianpolish")
>
> # Get data in matrix
> signals<- exprs(mixture.prepared)
> cl<- ifelse(colnames(signals) %in% disease$Array.Data.File,1,0)
>
> # Do design matrix and fit
> design<- model.matrix(~factor(cl))
> fit<- lmFit(signals,design)
> fit<- eBayes(fit)
> topTable(fit2,coef=2)
>
> Which yields the following:
> ID logFC AveExpr t P.Value adj.P.Val B
> 7513 208004_at -0.323 5.43 -4.65 0.00191 0.999 -3.10
> 11225 211829_s_at 0.340 5.07 4.36 0.00278 0.999 -3.17
> 5950 206424_at -0.907 6.65 -4.15 0.00363 0.999 -3.23
> 1354 201826_s_at -0.447 8.37 -4.13 0.00374 0.999 -3.24
> 19782 220418_at 0.392 5.43 4.02 0.00431 0.999 -3.27
> 8889 209396_s_at 1.899 7.47 4.01 0.00437 0.999 -3.28
> 5005 205478_at -0.931 9.22 -3.94 0.00481 0.999 -3.30
> 9469 209983_s_at 0.412 5.72 3.92 0.00492 0.999 -3.31
> 2936 203409_at 0.506 6.93 3.87 0.00531 0.999 -3.32
> 5054 205527_s_at 0.331 6.80 3.84 0.00549 0.999 -3.33
>
> I'm abit puzzled over the adjusted P-values. Can it really be true
> that ALL of the adjusted P-values are 0.999, or did I make a rookie
> mistake somewhere?
Well, there definitely is some weirdness going on. To answer your
question, yes it is possible for all the adjusted p-values to be right
close to 1, if there isn't any evidence for differential expression.
There are any number of reasons this can arise, but the main culprits in
my experience are usually very noisy intra-group data with hardly any
inter-group differences. A PCA plot can often shed light on this sort of
problem.
Now to get to the weirdness. You are getting these data by hand, when it
might be easier to get from within R, using GEOquery. Note here that I
have snipped extraneous output from my R session.
> library(GEOquery)
> geo <- getGEO("GSE21340")
> dat <- getGEOSuppFiles("GSE21340")
> setwd("GSE21340")
> system("tar xvf GSE21340_RAW.tar")
> library(affy)
> eset <- rma(ReadAffy())
> design <- model.matrix(~pData(geo[[1]])$source_name_ch1)
And just to note that I am using the correct data for the design matrix:
> levels(pData(geo[[1]])$source_name_ch1)
[1] "Control, Family History Neg" "Control, Family History Pos"
[3] "Replicate from 1 patient" "Type 2 DM"
> fit <- lmFit(eset, design)
> fit <- eBayes(fit)
> topTable(fit, 4)[,c(1,6)]
ID adj.P.Val
2454 M19483_at 0.005571008
204 D10523_at 0.005571008
2126 L38941_at 0.009749608
2299 M11717_rna1_at 0.015951421
2574 M25077_at 0.015951421
193 D00760_at 0.022937490
824 D87002_cds2_at 0.022937490
3079 M71243_f_at 0.027533357
1467 J02902_at 0.027533357
4802 U59423_at 0.028686408
Sorry about the alignment, but you see here that I do get some
differentially expressed genes, but not many. This is probably due in
part to the fact that I am using all four sample types in this analysis,
so the denominator in any contrast for me will be based on the sums of
squares of error for all four samples, whereas you are using only two.
This will increase the power to detect differentially expressed genes to
some extent.
However, note the probeset IDs, as well as the annotation of my eset object:
> annotation(eset)
[1] "hu6800"
If I search for the top probeset ID from your topTable, it appears to be
from a hgu133a or hgfocus array, not a hu6800 array (which is also known
as a HuGeneFl array). So it looks like your expression data might not be
from this experiment.
Best,
Jim
>
> Best Regards,
> David Westergaard
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list