[BioC] edgeR: Specifying the "coef"-argument in glmLRT in a two factor study

Sat Mar 31 01:07:43 CEST 2012

Dear Henning,

Thanks for including reproducible code.  I think however that you might be 
taking comments that were made in the edgeR User's Guide about an additive 
model and trying to apply them to an interaction model.  Let me clarify.

Your experiment has three genotypes and one treatment (stress vs control). 
When we analyse a two factor study, the first question we have to 
consider is whether it makes sense to assume that the treatment effect
will be the same for each genotype.  If the answer is yes, then you would 
fit an additive linear model:

   design <- model.matrix(~ genotype + treatment)

This model has just four coefficients.  The 4th coefficient would indeed 
test for a treatment effect irrespective of genotype.  This type of 
analysis is done in the last two case studies of the edgeR User's Guide:

http://bioconductor.org/packages/2.10/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

This type of analysis is appropriate when the non-treatment factor is a 
blocking or nuisance variable that is not of primary scientific interest, 
for example a batch effect.

In your hypothesized study, however, your non-treatment factor is 
genotype.  Usually one would design a study of this type to see whether 
the treatment effect differs between genotypes.  So you need to fit a 
model that allows the treatment effects to be different between genotypes, 
for example

   Model 1:  ~ genotype * treatment

which is equivalent to ~genotype+treatment+genotype:treatment, or

   Model 2: ~ genotype + genotype:treatment

These models are all equivalent, except for parametrization, and all have 
six coefficients.  With these model you cannot test for "treatment effect 
irrespective of genotype", because there is no such thing.  You are making 
the assumption that the treatment effect may depends on genotype, hence it 
makes no scientific sense to talk about a treatment effect without regard 
to the genotype it applies to.

If you want to know which genes have a treatment effect in genotype A, you 
would fit Model 2 and test for coef=4.  For treatment effect in genotype 
B, test for coef=5, and for treatment effect in genotype C, test for 
coef=6.

If you want to know which genes have treatment effects that differ between 
the genotypes, then you would test for interaction.  You do this by 
fitting Model 1 and testing for "coef=4:6".  Not that this is one test, 
not three tests.

Best wishes
Gordon

> Date: Wed, 28 Mar 2012 15:11:59 +0200
> From: "Henning Wildhagen" <HWildhagen at gmx.de>
> To: bioconductor at r-project.org
> Subject: [BioC] edgeR: Specifying the "coef"-argument in glmLRT in a
> 	two	factor study
>
> Hi all,
>
> i am analysing a two-factorial RNA-seq experiment using edgeR. The design of my study has two factors, genotype and treatment.
> Genotype has three levels (A,B,C), "treatment" has two levels ("control", "stress").
> I have some questions regarding the usage of the "coef" and "contrast"-argument using the glmLRT()-function as outlined below:
>
> #####################################################################
> BEGIN CODE
> library(edgeR)
> # Generate a dataframe of neg.binom counts
> set.seed(50)
> s <- as.data.frame(matrix(rnbinom(3000,mu=20,size=20),ncol=3))
> u <- as.data.frame(matrix(rnbinom(3000,mu=30,size=25),ncol=3))
> v <- as.data.frame(matrix(rnbinom(3000,mu=40,size=20),ncol=3))
> w <- as.data.frame(matrix(rnbinom(3000,mu=50,size=25),ncol=3))
> x <- as.data.frame(matrix(rnbinom(3000,mu=60,size=20),ncol=3))
> y <- as.data.frame(matrix(rnbinom(3000,mu=70,size=25),ncol=3))
> z <- as.data.frame(cbind(s,u,v,w,x,y))
> names(z) <- c("A_Con1", "A_Con2","A_Con3","A_Str1", "A_Str2","A_Str3",
> "B_Con1","B_Con2","B_Con3","B_Str1","B_Str2","B_Str3",
> "C_Con1","C_Con2","C_Con3","C_Str1","C_Str2","C_Str3")
>
> # Specify factors
> genotype <- as.factor(c(rep("A",6),rep("B",6),rep("C",6)))
> treatment <-as.factor(rep(c(rep("Control",3),rep("Stress",3)),3))
>
> # DGEList-object
> groups <- c("A_Con", "A_Con","A_Con","A_Str", "A_Str","A_Str",
> "B_Con","B_Con","B_Con","B_Str","B_Str","B_Str",
> "C_Con","C_Con","C_Con","C_Str","C_Str","C_Str")
>
> tmp1 <- DGEList(counts=z,group=groups)
> tmp1 <- calcNormFactors(tmp1)
>
> # Design matrix
> design <- model.matrix(~genotype*treatment)
>
> # Estimating Cox-Reid tagwise dispersion
> tmp2 <- getPriorN(tmp1, design=design,prior.df=20)
> tmp1 <- estimateTagwiseDisp(tmp1,prior.n=tmp2,trend="movingave",prop.used=0.3,grid.length=500)
> summary(tmp1$tagwise.dispersion)
> tmp1$common.disp
>
> # Calling DE genes with the full model
> glmfit.tmp1 <- glmFit(tmp1,design=design,dispersion=tmp1$tagwise.dispersion,method="auto")
>
> # Testing for the effect of "treatment"
> lrt.tmp1.treat <- glmLRT(tmp1,glmfit.tmp1,coef=4) #
> treat <- topTags(lrt.tmp1.treat,n=nrow(lrt.tmp1.treat))
> treat1 <- treat[[1]]$adj.P.Val < 0.5
> table(treat1)
> # 2 DE genes
>
> END CODE
> ##################################################################
>
> According to the vignette and previous posts on this issue, specifying coef=4 should test for the difference between
> control and stress samples, irrespective of the level of the second factor.
> However, since in this design matrix, the reference cell is "genotypeA-treatmentControl", i have the impression that specifying
> coef=4 in the glmLRT()-function tests for the effect of treatment specifically on genotypeA rather than for the treatment
> effect irrespective of "genotype".
>> From working through the package "contrast", i conclude that for testing for the effect of treatment averaged over genotype,
> i need to specify a contrast as follows:
>
> ### BEGIN CODE
> lrt.tmp1.trt.av <- glmLRT(tmp1,glmfit.tmp1,contrast=c(0,0,0,1,1/3,1/3))
> treat.av <- topTags(lrt.tmp1.trt.av,n=nrow(lrt.tmp1.trt.av))
> treat.av1 <- treat.av[[1]]$adj.P.Val < 0.5
> table(treat.av1)
> # 9 DE genes
> sessionInfo()
> #R version 2.14.0 (2011-10-31)
> #Platform: x86_64-pc-linux-gnu (64-bit)
> #
> #locale:
> # [1] LC_CTYPE=de_DE.UTF-8          LC_NUMERIC=C
> # [3] LC_TIME=de_DE.UTF-8           LC_COLLATE=de_DE.UTF-8
> # [5] LC_MONETARY=de_DE.UTF-8       LC_MESSAGES=de_DE.UTF-8
> # [7] LC_PAPER=de_DE.UTF-8          LC_NAME=de_DE.UTF-8
> # [9] LC_ADDRESS=de_DE.UTF-8        LC_TELEPHONE=de_DE.UTF-8
> #[11] LC_MEASUREMENT=de_DE.UTF-8    LC_IDENTIFICATION=de_DE.UTF-8
> #
> #attached base packages:
> #[1] stats     graphics  grDevices utils     datasets  methods   base
> #
> #other attached packages:
> #[1] edgeR_2.4.3  limma_3.10.3 JGR_1.7-9    iplots_1.1-4 JavaGD_0.5-5
> #[6] rJava_0.9-3
> #
> #loaded via a namespace (and not attached):
> #[1] tools_2.14.0
> ### END CODE
>
> So, my question is, do i really test for the treatment effect irrespective of genotype by specifiying "coef=4" (=contrast=c(0,0,0,1,0,0)) or rather
> for the effect of treatment for the reference genotype ("A"). In the latter case, what would then be the correct approach to test
> for the effect of treatment irrespective of genotype?
> In either case, did i get it right that the obtained p-Values indicate the significance of the effect rather than for the test that the coefficients
> differ from zero?
>
> Thanks for your help, Henning
>
>
> ----------------------------------------------
> Dr. Henning Wildhagen
> Chair of Tree Physiology, University of Freiburg
> Germany
> --

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}