[BioC] Problem fitting a linear model using limma
James W. MacDonald
jmacdon at uw.edu
Wed Mar 7 21:07:43 CET 2012
Hi Joaquin,
On 3/7/2012 2:12 PM, Joaquin Martinez wrote:
> Hi Jim,
>
> Thank you for your quick reply and advice. Actually we had considered
> doing exactly what you suggest and remove the numbers from the sample
> names. My concern then is what is then the reference or baseline?,
> i.e. Puninf.1 or an average value of Puninf.1, Puninf.2 and Puninf.3.
> Also, how does removing numbers take into consideration slight value
> differences among biological replicate samples?
The baseline is the average of the Puninf samples. Which is, I believe,
what you want. To make the comparisons you request, limma is going to
compute a t-statistic, which simply put is
difference in means between two samples/ measure of the variability of
the means
So the numerator of the statistic tells you how different the two
samples are. The denominator (which is a measure of the slight
differences among biological replicates) is used to determine if the
differences between samples is 'large' or not. If there is a lot of
variability within sample types, the numerator has to be pretty big.
Conversely, if there isn't much variability within sample types then a
smaller numerator may be considered significantly large.
Note that you aren't forced to use Puninf as the reference. That's just
the easiest way to use modelMatrix(). You can also pass in a matrix for
the 'param' argument, selecting direct comparisons to make. Or you can
just leave Puninf as the reference, and make further comparisons with a
contrast matrix, which is probably simpler.
Note here that the design matrix with Puninf as reference is computing
implicit comparisons (e.g., the M163 coefficient is really log2(M163) -
log2(Puninf)). So if you want to know if there is a significant
difference between M163 and M86, you can just add a contrast matrix that
looks like
1
-1
0
0
Which simply means that we will be computing log2(M163) - log2(Puninf) -
(log2(M86) - log2(Puninf)). Algebraically, the two log2(Puninf) will
cancel out, and you end up with log2(M163) - log2(M86). You can create
any contrast matrix you want using makeContrasts().
>
> In the example we are following in the limma user's guide there are 3
> wild-type mice (wt1,wt2,wt3) and 3 mutant mice (mu1, mu2, mu3), where
> wt1 is used as the reference. I think we have the same type of study
> but with 6 "mouse types".
Unless I am mistaken, you are looking at the wrong example. The example
you reference involves technical replication, where RNA from each mouse
was hybridized to more than one slide. From what you said originally,
each of the samples in your experiment is a biological replicate. Your
experiment is more similar to the common reference design experiment on
p34 of the limma User's Guide.
Best,
Jim
>
> Thanks,
> Joaquin
>
>
>
>
> On Wed, Mar 7, 2012 at 12:01 PM, James W. MacDonald <jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>> wrote:
>
> Hi Joaquin,
>
>
> On 3/7/2012 11:08 AM, Joaquin Martinez wrote:
>
> Dear All,
>
> My colleague Ben and I have fitted a linear model to
> microarray data using
> the limma package but we get "coefficients not estimable" for
> the last two
> columns in the design matrix during the fit. As a consequence
> when trying
> to fit our contrast matrix we get the following error message,
> "Error in
> contrasts.fit(fit, contrasts = A) : trying to take contrast of
> non-estimable coefficient." We have discovered that if we
> simply reorder
> the design matrix columns and contrasts matrix rows, we
> encounter the same
> error. We would very much appreciate if someone could explain
> to us why the
> not-estimable coefficients arise, and how to correct the problem.
>
> We are following the guide (section 8.2) for limma 3.10.0
> using this
> version of R 2.14.0 (sessionInfo at end of email). We have 6
> different
> types of samples (Puninf, P86, P163, Muninf, M86, M163), and 3
> biological
> replicates of each (1,2,3; 4,5,6; 7,8,9; 10,11,12; 13,14,15;
> 16,17,18).
>
>
> Yes, but your targets file has numbers appended to the sample
> types, which makes R think they are different. Hypothetically you
> would have noticed this when using the modelMatrix() function:
>
>
> > modelMatrix(targets, ref = "Puninf.1")
> Found unique target names:
> M163.16 M163.17 M163.18 M86.13 M86.14 M86.15 Muninf.10 Muninf.11
> Muninf.12 P163.7 P163.8 P163.9 P86.4 P86.5 P86.6 Puninf.1 Puninf.2
> Puninf.3
>
> This indicates that limma thinks all of your replicates are
> different sample types. This won't work out (as you have already
> found) because the model will be over specified, which simply
> means that you cannot estimate all the parameters you have in the
> model. This is the same as saying you can't determine the values
> of x and y in the formula x + y = 3 (but if you also know that x
> - y = -1, then you can solve for both x and y).
>
> Now back to the problem at hand. If you remove the numbers from
> your sample types,
>
> targets$Cy3 <- sapply(strsplit(targets$Cy3, "\\."), function(x) x[1])
> targets$Cy5 <- sapply(strsplit(targets$Cy5, "\\."), function(x) x[1])
>
> then you will create the correct model matrix
>
> > modelMatrix(targets, ref="Puninf")
> Found unique target names:
> M163 M86 Muninf P163 P86 Puninf
> M163 M86 Muninf P163 P86
> [1,] 0 0 1 0 0
> [2,] 0 0 0 0 1
> [3,] 0 0 0 -1 0
> [4,] 0 -1 1 0 0
> [5,] 1 0 -1 0 0
> [6,] 0 1 0 0 -1
> [7,] -1 0 0 1 0
> [8,] 0 0 -1 0 0
> [9,] 0 0 0 0 -1
> [10,] 0 0 0 1 0
> [11,] 0 1 -1 0 0
> [12,] -1 0 1 0 0
> [13,] 0 -1 0 0 1
> [14,] 1 0 0 -1 0
> [15,] 0 0 1 0 0
> [16,] 0 0 0 0 1
> [17,] 0 0 0 -1 0
> [18,] 0 -1 1 0 0
> [19,] 1 0 -1 0 0
> [20,] 0 1 0 0 -1
> [21,] -1 0 0 1 0
>
> Which will compare all samples to Puninf.
>
> Best,
>
> Jim
>
>
>
>
>
>
> # The targets file looks like this
>
> targets<- structure(list(SlideNumber = c(29L, 30L, 31L, 32L,
> 33L, 34L,
> 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 51L, 46L,
> 47L,
> 48L, 50L), FileName = c("BE T22h slide 29", "BE T22h slide 30",
> "BE T22h slide 31", "BE T22h slide 32", "BE T22h slide 33",
> "BE T22h
> slide 34",
> "BE T22h slide 35", "BE T22h slide 36", "BE T22h slide 37",
> "BE T22h
> slide 38",
> "BE T22h slide 39", "BE T22h slide 40", "BE T22h slide 41",
> "BE T22h
> slide 42",
> "BE T22h slide 43", "BE T22h slide 44", "BE T22h slide 51",
> "BE T22h
> slide 46",
> "BE T22h slide 47", "BE T22h slide 48", "BE T22h slide 50"),
> Cy3 = c("Puninf.1", "Puninf.1", "P163.7", "M86.13",
> "Muninf.10",
> "P86.4", "M163.16", "Muninf.11", "P86.5", "Puninf.2",
> "Muninf.11",
> "M163.17", "M86.14", "P163.8", "Puninf.3", "Puninf.3",
> "P163.9",
> "M86.15", "Muninf.12", "P86.6", "M163.18"), Cy5 =
> c("Muninf.10",
> "P86.4", "Puninf.1", "Muninf.10", "M163.16", "M86.13",
> "P163.7",
> "Puninf.2", "Puninf.2", "P163.8", "M86.14", "Muninf.11",
> "P86.5", "M163.17", "Muninf.12", "P86.6", "Puninf.3",
> "Muninf.12",
> "M163.18", "M86.15", "P163.9")), .Names = c("SlideNumber",
> "FileName", "Cy3", "Cy5"), class = "data.frame", row.names
> = c(NA,
> -21L))
>
> # for readability we also paste it in here...
> #> targets
> # SlideNumber FileName Cy3 Cy5
> # 1 29 BE T22h slide 29 Puninf.1 Muninf.10
> # 2 30 BE T22h slide 30 Puninf.1 P86.4
> # 3 31 BE T22h slide 31 P163.7 Puninf.1
> # 4 32 BE T22h slide 32 M86.13 Muninf.10
> # 5 33 BE T22h slide 33 Muninf.10 M163.16
> # 6 34 BE T22h slide 34 P86.4 M86.13
> # 7 35 BE T22h slide 35 M163.16 P163.7
> # 8 36 BE T22h slide 36 Muninf.11 Puninf.2
> # 9 37 BE T22h slide 37 P86.5 Puninf.2
> # 10 38 BE T22h slide 38 Puninf.2 P163.8
> # 11 39 BE T22h slide 39 Muninf.11 M86.14
> # 12 40 BE T22h slide 40 M163.17 Muninf.11
> # 13 41 BE T22h slide 41 M86.14 P86.5
> # 14 42 BE T22h slide 42 P163.8 M163.17
> # 15 43 BE T22h slide 43 Puninf.3 Muninf.12
> # 16 44 BE T22h slide 44 Puninf.3 P86.6
> # 17 51 BE T22h slide 51 P163.9 Puninf.3
> # 18 46 BE T22h slide 46 M86.15 Muninf.12
> # 19 47 BE T22h slide 47 Muninf.12 M163.18
> # 20 48 BE T22h slide 48 P86.6 M86.15
> # 21 50 BE T22h slide 50 M163.18 P163.9
>
> # The design matrix is computed like this design<-
> modelMatrix(targets,
> ref = "Puninf.1")
> # but here is the structure if it is helpful
> design<- structure(c(0, 0, 0, 0, 1, 0, -1, 0, 0, 0, 0, 0, 0,
> 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 1, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 1, 0, -1, 0, 0, 0, -1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, -1,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, -1, 0, 1, 0, 1, 0, 0, 1, -1, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, -1,
> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 1, 0, 0, 1, -1, 0, 0, 0, 0, -1, 0, 0, 0, 1, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
> 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 1, 0, 1, 0, 0, 0, -1, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> -1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0,
> 0, 1, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, 1, 0, 0, 0, 0), .Dim =
> c(21L,
> 17L), .Dimnames = list(c("slide 29", "slide 30", "slide 31",
> "slide 32", "slide 33", "slide 34", "slide 35", "slide 36",
> "slide 37",
> "slide 38", "slide 39", "slide 40", "slide 41", "slide 42",
> "slide 43",
> "slide 44", "slide 51", "slide 46", "slide 47", "slide 48",
> "slide 50"
> ), c("M163.16", "M163.17", "M163.18", "M86.13", "M86.14",
> "M86.15",
> "Muninf.10", "Muninf.11", "Muninf.12", "P163.7", "P163.8",
> "P163.9",
> "P86.4", "P86.5", "P86.6", "Puninf.2", "Puninf.3")))
>
>
> # Fit linear model - commented out becuase we can't really
> send all the
> data in MA
> # fit<- lmFit(MA, design = design)
> # Coefficients not estimable: Puninf.2 Puninf.3
>
> # Make a contrast matrix:
> A<-
> makeContrasts(PuninfvsMuninf=(Muninf.10+Muninf.11+Muninf.12-Puninf.2-Puninf.3)/3,
> levels = design)
>
> # and now fit the contrast
> con.fit<- contrasts.fit(fit, contrasts = A)
> Error in contrasts.fit(fit, contrasts = A) : trying to take
> contrast of
> non-estimable coefficient
>
> # now change the column order in the design matrix and row
> order of
> contrast matrix
> # we create a new sort index - arbitrarily placing the last
> two columns in
> the original
> # design matrix in the middle of the new one. Ditto for the
> rows of the
> contrast matrix.
> n<- nrow(A)
> newIndex<- c(1:5, n-1, n, 6:(n-2))
> designB<- design[,newIndex]
> B<- A[newIndex,]
> attributes(B)<- attributes(A)
>
> # create a new fit, but we get the same knd of error, the last
> two columns
> are
> # identified as non-estimable
> fitB<- lmFit(X$MA, design = designB)
> #Coefficients not estimable: P86.5 P86.6
> #Warning message:
> #Partial NA coefficients for 16278 probe(s)
>
>
> As far as we can tell, it's the trailing columns that get
> singled out, but
> we don't know why. Hopefully the information about is detailed
> enough, but
> we can provide more information if necessary to help
> troubleshooting.
>
> Thanks in advance,
>
> Joaquin
>
> sessionInfo()
>
> R version 2.14.0 (2011-10-31)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] limma_3.10.0
>
> loaded via a namespace (and not attached):
> [1] tools_2.14.0
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list