[BioC] estimateDispersions in DESeq
Wolfgang Huber
whuber at embl.de
Fri Jun 22 00:29:03 CEST 2012
Dear Flavia
Thank you for your detailed and informative feedback!
Simon will be better able to address the question regarding what exactly
changed between the versions with respect to the different parameter
choices (sharingMode, fitType, method).
Two other comments:
1. The fact that you see such a strong dependence on these choices might
be a symptom of there being outliers (either whole samples being
outliers, or certain measurements in some samples). Outlier detection is
generally difficult to automate, yet outliers can have an excessive
impact on inference, especially with such small sample sizes. Did you
have a look at the 'pairs' plot between all 6 communities (on a log or
log-like scale)?
2. The dispersion-mean model was developed on RNA-Seq data. I have no
experience how well it fits to OTU counts in metagenomics. Your
observation on the plot may indicate a problem. If you don't mind, I'd
be interested in having a look at your data (you can anonymize the OTUs)
to see how well the dispersion-mean model used by DESeq fits that.
Best wishes
Wolfgang
-> More below. ->
Jun/21/12 2:50 PM, Flavia Nunes scripsit::
> Dear List,
>
> I am trying to use DESeq to analyse a dataset where we have samples of 3
> healthy and 3 diseased microbial communities, and we are trying to
> establish which OTUs are significantly more or less abundant in the healthy
> vs diseased samples.
>
> I tried running the new version of DESeq (1.8.3) on both a Mac and a PC
> running the latest version of R (2.15). Both versions give a strange
> result, where all OTUs have a padj value that is >0.7. I found this to be
> strange, because when looking at the raw count data, it is obvious that
> some OTUs are abundant in the one treatment (say, high counts in all of the
> heathy samples) and absent in the other (0 or close to 0 on all of the
> diseased samples).
Due to the variance-sharing of 'genes' with similar mean counts, this
could easily happen if there are *other* OTU with similar counts, but
highly discordant values within groups, forcing a high dispersion
estimate, and costing power even for those OTUs that you mention.
To avoid the dispersion-mean model, if you are willing to make the jump:
with 3 vs 3 samples you're in a place where sharingMode =
"gene-est-only" might already be useful - but I would consider the above
suggestions first.
>
> I asked a colleague to help me with the analysis, and he ran the analysis
> on an older version of DESeq (1.4), using the estimateVarianceFunction
> command instead of estimateDispersions. We saw that in the help file for
> estimateDispersions, that by using the sharingMode="fit-only",
> fitType="local" options, we should be able to get the same result as the
> estimateVarianceFunction. However, this is not the case. DESeq 1.4 was
> able to find 54 OTUs that were significantly different from healthy vs
> diseased samples, while DESeq 1.8.3 found that none of the OTUs were
> significantly different in healthy vs diseased.
>
> In a second attempt, we used the option method="per-condition" and this
> worked - I got the same 54 significant p-values as in the analysis with
> DESeq 1.4 But when I continued the analysis on other datasets (we have a
> number of different conditions), I again started to get odd p-values, such
> as 1.00 for every OTU. I changed the setting for the estimateDispersions
> command, trying different methods, and each time I would get a different
> set of p-values, but usually very high numbers, close to 1.
>
> It seems to me that the results are really sensitive to the method used to
> estimate dispersions, and I was wondering what are the properties of the
> data that I might have to look for in order to select the best method.
>
> Another unusual thing that I have noticed is that when I plot the
> Dispersion Estimates, the fit line deviates from the points towards the
> right side of the graph. This suggests to me that there must be something
> wrong with the fit estimate, but I do not know how I might be able to
> change the settings to get a better fit.
>
> I wanted to know if anyone on the list has come across a similar problem?
>
> I am using the commands below in DESeq. I can provide files of the data,
> as well as the results that I am receiving to anyone that might be
> interested in taking a closer look.
>
> WBDCountTable <- read.table( file.choose(), header=TRUE, row.names=1 )
> WBDDesign <- data.frame(row.names = colnames( WBDCountTable ), condition =
> c( "D1", "D2", "D3", "H1", "H2", "H3"), libType = c( "single-end",
> "single-end", "single-end", "single-end", "single-end", "single-end" ) )
> conds <- factor( c( "D", "D", "D", "H", "H", "H" ) )
> cds <- newCountDataSet( WBDCountTable, conds )
> cds <- estimateSizeFactors( cds )
> cds <- estimateDispersions( cds, method="per-condition", fitType="local" )
>
> plotDispEsts <- function( cds )
> {
> plot(
> rowMeans( counts( cds ) ),
> fitInfo(cds)$perGeneDispEsts,
> pch = 16, cex=1, log="xy" )
> xg <- 10^seq( -.5, 5, length.out=300 )
> lines( xg, fitInfo(cds)$dispFun( xg ), col="red" , lwd=3)
> }
>
> res <- nbinomTest( cds, "D", "H" )
>
> plotDE <- function( res )
> plot(res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.3, col =
> ifelse( res$padj < .1, "red", "black" ) )
> plotDE( res )
>
> res
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Best wishes
Wolfgang
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list