[BioC] A question with DEXSeq package: inconsistency between normalized counts vs. fitted expression, fitted splicing or fold changes
Yao,Hui
hyao at mdanderson.org
Mon Jun 25 17:51:01 CEST 2012
Dear DEXSeq authors and users,
I am using DEXSeq package (1.2.0) to do splicing analysis for human genome. One part of results from the analysis is really confusing me. It seems that for many genes the fitted expression, the fitted splicing and estimated fold changes are inconsistent (actually reversed) with its normalized counts.
To clearly explain the problem, I am showing a simple example with only two genes with 30 samples as below.
> load("toBioConductor.RData")
> library(DEXSeq)
> ls()
[1] "testgene"
### For this data set, our interest is to investigate the differential usage of exons between two tissue "type", X and N.
### The samples were collected from two batches, So we need to adjust the batch effects in the following model.
> formu <- count ~ sample + (exon + from)*type
> testgene <- estimateDispersions(testgene, formula=formu)
> testgene <- fitDispersionFunction(testgene)
> f0 <- count~sample + from*exon + type
> f1 <- count~sample + from*exon + type*I(exon==exonID)
> testgene <- testForDEU(testgene,formula0=f0, formula1=f1)
> testgene <- estimatelog2FoldChanges(testgene,fitExpToVar="type" )
> res <- DEUresultTable(testgene)
n the attached figure, toBioConductor-plotDEXSeq.pdf, for E011, its normalized counts of "N" are clearly smaller than those of "X". However, for both Fitted splicing and Fitted expression, the level of "N" is larger than that of "X". And if we checked the estimated fold change as below, the fold change is consistent with fitted expression and fitted splicing.
> res["ENSG00000001497:011",]
geneID exonID dispersion pvalue padjust
ENSG00000001497:011 ENSG00000001497 E011 0.4096852 0.0006160211 0.01334712
meanBase log2fold(X/N)
ENSG00000001497:011 5.714405 -1.936449
We also explicitly check the normalized counts for E011 of this gene. As below shows the median of each type. Those of "N" is clearly smaller than "X".
> dtbl <- data.frame(counts=counts(testgene,normalized=T)["ENSG00000001497:011",],type=pData(testgene)$type)
> with(dtbl,tapply(counts,type,median))
X N
8.534784 2.064785
> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US
[4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DEXSeq_1.2.0 Biobase_2.14.0
loaded via a namespace (and not attached):
[1] biomaRt_2.10.0 hwriter_1.3 plyr_1.7.1 RCurl_1.91-1 statmod_1.4.14
[6] stringr_0.6 tools_2.14.0 XML_3.9-4
So, have I made any mistakes in the analysis?
Many thanks in advance,
Hui
Hui Yao, Ph.D.
Principal Statistical Analyst
MD Anderson Cancer Center
-------------- next part --------------
A non-text attachment was scrubbed...
Name: toBioConductor-plotDEXSeq.pdf
Type: application/pdf
Size: 18588 bytes
Desc: toBioConductor-plotDEXSeq.pdf
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20120625/45c8527d/attachment.pdf>
More information about the Bioconductor
mailing list