[BioC] edgeR: Specifying the "coef"-argument in glmLRT in a two factor study
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Mar 31 01:07:43 CEST 2012
Dear Henning,
Thanks for including reproducible code. I think however that you might be
taking comments that were made in the edgeR User's Guide about an additive
model and trying to apply them to an interaction model. Let me clarify.
Your experiment has three genotypes and one treatment (stress vs control).
When we analyse a two factor study, the first question we have to
consider is whether it makes sense to assume that the treatment effect
will be the same for each genotype. If the answer is yes, then you would
fit an additive linear model:
design <- model.matrix(~ genotype + treatment)
This model has just four coefficients. The 4th coefficient would indeed
test for a treatment effect irrespective of genotype. This type of
analysis is done in the last two case studies of the edgeR User's Guide:
http://bioconductor.org/packages/2.10/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
This type of analysis is appropriate when the non-treatment factor is a
blocking or nuisance variable that is not of primary scientific interest,
for example a batch effect.
In your hypothesized study, however, your non-treatment factor is
genotype. Usually one would design a study of this type to see whether
the treatment effect differs between genotypes. So you need to fit a
model that allows the treatment effects to be different between genotypes,
for example
Model 1: ~ genotype * treatment
which is equivalent to ~genotype+treatment+genotype:treatment, or
Model 2: ~ genotype + genotype:treatment
These models are all equivalent, except for parametrization, and all have
six coefficients. With these model you cannot test for "treatment effect
irrespective of genotype", because there is no such thing. You are making
the assumption that the treatment effect may depends on genotype, hence it
makes no scientific sense to talk about a treatment effect without regard
to the genotype it applies to.
If you want to know which genes have a treatment effect in genotype A, you
would fit Model 2 and test for coef=4. For treatment effect in genotype
B, test for coef=5, and for treatment effect in genotype C, test for
coef=6.
If you want to know which genes have treatment effects that differ between
the genotypes, then you would test for interaction. You do this by
fitting Model 1 and testing for "coef=4:6". Not that this is one test,
not three tests.
Best wishes
Gordon
> Date: Wed, 28 Mar 2012 15:11:59 +0200
> From: "Henning Wildhagen" <HWildhagen at gmx.de>
> To: bioconductor at r-project.org
> Subject: [BioC] edgeR: Specifying the "coef"-argument in glmLRT in a
> two factor study
>
> Hi all,
>
> i am analysing a two-factorial RNA-seq experiment using edgeR. The design of my study has two factors, genotype and treatment.
> Genotype has three levels (A,B,C), "treatment" has two levels ("control", "stress").
> I have some questions regarding the usage of the "coef" and "contrast"-argument using the glmLRT()-function as outlined below:
>
> #####################################################################
> BEGIN CODE
> library(edgeR)
> # Generate a dataframe of neg.binom counts
> set.seed(50)
> s <- as.data.frame(matrix(rnbinom(3000,mu=20,size=20),ncol=3))
> u <- as.data.frame(matrix(rnbinom(3000,mu=30,size=25),ncol=3))
> v <- as.data.frame(matrix(rnbinom(3000,mu=40,size=20),ncol=3))
> w <- as.data.frame(matrix(rnbinom(3000,mu=50,size=25),ncol=3))
> x <- as.data.frame(matrix(rnbinom(3000,mu=60,size=20),ncol=3))
> y <- as.data.frame(matrix(rnbinom(3000,mu=70,size=25),ncol=3))
> z <- as.data.frame(cbind(s,u,v,w,x,y))
> names(z) <- c("A_Con1", "A_Con2","A_Con3","A_Str1", "A_Str2","A_Str3",
> "B_Con1","B_Con2","B_Con3","B_Str1","B_Str2","B_Str3",
> "C_Con1","C_Con2","C_Con3","C_Str1","C_Str2","C_Str3")
>
> # Specify factors
> genotype <- as.factor(c(rep("A",6),rep("B",6),rep("C",6)))
> treatment <-as.factor(rep(c(rep("Control",3),rep("Stress",3)),3))
>
> # DGEList-object
> groups <- c("A_Con", "A_Con","A_Con","A_Str", "A_Str","A_Str",
> "B_Con","B_Con","B_Con","B_Str","B_Str","B_Str",
> "C_Con","C_Con","C_Con","C_Str","C_Str","C_Str")
>
> tmp1 <- DGEList(counts=z,group=groups)
> tmp1 <- calcNormFactors(tmp1)
>
> # Design matrix
> design <- model.matrix(~genotype*treatment)
>
> # Estimating Cox-Reid tagwise dispersion
> tmp2 <- getPriorN(tmp1, design=design,prior.df=20)
> tmp1 <- estimateTagwiseDisp(tmp1,prior.n=tmp2,trend="movingave",prop.used=0.3,grid.length=500)
> summary(tmp1$tagwise.dispersion)
> tmp1$common.disp
>
> # Calling DE genes with the full model
> glmfit.tmp1 <- glmFit(tmp1,design=design,dispersion=tmp1$tagwise.dispersion,method="auto")
>
> # Testing for the effect of "treatment"
> lrt.tmp1.treat <- glmLRT(tmp1,glmfit.tmp1,coef=4) #
> treat <- topTags(lrt.tmp1.treat,n=nrow(lrt.tmp1.treat))
> treat1 <- treat[[1]]$adj.P.Val < 0.5
> table(treat1)
> # 2 DE genes
>
> END CODE
> ##################################################################
>
> According to the vignette and previous posts on this issue, specifying coef=4 should test for the difference between
> control and stress samples, irrespective of the level of the second factor.
> However, since in this design matrix, the reference cell is "genotypeA-treatmentControl", i have the impression that specifying
> coef=4 in the glmLRT()-function tests for the effect of treatment specifically on genotypeA rather than for the treatment
> effect irrespective of "genotype".
>> From working through the package "contrast", i conclude that for testing for the effect of treatment averaged over genotype,
> i need to specify a contrast as follows:
>
> ### BEGIN CODE
> lrt.tmp1.trt.av <- glmLRT(tmp1,glmfit.tmp1,contrast=c(0,0,0,1,1/3,1/3))
> treat.av <- topTags(lrt.tmp1.trt.av,n=nrow(lrt.tmp1.trt.av))
> treat.av1 <- treat.av[[1]]$adj.P.Val < 0.5
> table(treat.av1)
> # 9 DE genes
> sessionInfo()
> #R version 2.14.0 (2011-10-31)
> #Platform: x86_64-pc-linux-gnu (64-bit)
> #
> #locale:
> # [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
> # [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
> # [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
> # [7] LC_PAPER=de_DE.UTF-8 LC_NAME=de_DE.UTF-8
> # [9] LC_ADDRESS=de_DE.UTF-8 LC_TELEPHONE=de_DE.UTF-8
> #[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=de_DE.UTF-8
> #
> #attached base packages:
> #[1] stats graphics grDevices utils datasets methods base
> #
> #other attached packages:
> #[1] edgeR_2.4.3 limma_3.10.3 JGR_1.7-9 iplots_1.1-4 JavaGD_0.5-5
> #[6] rJava_0.9-3
> #
> #loaded via a namespace (and not attached):
> #[1] tools_2.14.0
> ### END CODE
>
> So, my question is, do i really test for the treatment effect irrespective of genotype by specifiying "coef=4" (=contrast=c(0,0,0,1,0,0)) or rather
> for the effect of treatment for the reference genotype ("A"). In the latter case, what would then be the correct approach to test
> for the effect of treatment irrespective of genotype?
> In either case, did i get it right that the obtained p-Values indicate the significance of the effect rather than for the test that the coefficients
> differ from zero?
>
> Thanks for your help, Henning
>
>
> ----------------------------------------------
> Dr. Henning Wildhagen
> Chair of Tree Physiology, University of Freiburg
> Germany
> --
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list