[BioC] A few Q's on using DEXSeq with mucho data
Simon Anders
anders at embl.de
Thu Mar 8 18:59:00 CET 2012
Hi Steve
On 03/08/2012 04:08 PM, Steve Lianoglou wrote:
> Imagine, if you will, that I have data for 7-8 different conditions
> and I'd like to use DEXSeq to test for differential exon usage.
>
> Is it best to create an ExonCountSet with all of my data (with an
> appropriate design matrix that identifies which samples belong to
> which condition) and do the estimateDispersion step with that?
>
> My guess is yes, but I wanted to double check -- am I in danger of
> maybe flagging some genes/exons as non-testable if, for example, it is
> only expressed in 2 out of 8 conditions?
An exon is flagged as not testable if the sum over all samples of the
read counts mapped to it is less then 'minCounts' (default: 10). Hence,
adding additional samples will never reduce the number of testable exons.
> Assuming I should use all of my data to `estimateDisperion`s, what if
> I only have one replicate for one of the 8 conditions, is it best to
> remove it? I'm guessing it wouldn't provide any meaning information
> for dispersion estimation since it's only one observation for that
> condition.
The non-replicates sample would be ignored in the dispersion estimation
anyway.
> Lastly, when testing for differences in exon usage, assuming I've been
> using the data from all of my experiments up to this point, I don't
> see a way to specify which experiments I want to specifically test.
>
> If I run "the normal" DEXSeq analysis on my data, I end up with a
> DEUresultTable that looks like it has log2fold(x/y) values for all
> experiments against just one y. There is only one pvalue/padjust
> column, so I'm not sure what comparison that is for.
The p value is from a test against the null hypothesis that exon usage
does not depend on condition, i.e., that it is the same between all
conditions (similar to the F test in standard anova). If you want to
contrast two specific conditions, you should subset.
This leaves the question whether to subset before or after dispersion
estimation, which is what your follow-up plot is concerned with, too:
> I had some thought for you that struck me right after I sent out this
> email, specifically your "hunch" about using all the data to estimate
> and fit the dispersion for each "bin".
>
> Perhaps it's a good idea to use all the data to estimateDispersions?
> (Still quite curious about that), but it seems that doing the
> subsequent `fitDispersionFunction` step might not be a great idea to
> run against all of the data at once.
>
> I say this because by looking through the code, it seems like the
> dispersion is fit for each exon/bin by the (normalized) mean
> expression of that bin across the entire dataset. So, if the
> expression of the gene (and exon) is quite variable across all
> conditions we have data for, when we go back and try to test
> differential exon usage for a specific condition 1 vs. condition 2
> case, I think we'd rather fit the dispersion for the mean expression
> of that bin for the two conditions under test.
>
> Isn't that right?
If some of your conditions have much higher variability than the other
conditions, they will in fact cause you to lose power even in
comparisons which do not involve them. This would be an argument for
subsetting before dispersion estimation.
However, if the dispersions are roughly the same in all conditions, you
are better off with estimating a pooled dispersion from all samples.
This is because the per-exon estimates are then based on more degrees of
freedom and have less sampling variance. Remember that DEXSeq uses in
its test the maximum of the per-exon estimate and the fitted estimate.
This is to avoid that exons with truly high variability are tested with
a too low dispersion value, which would likely lead to a false positive.
However, a high per-exon estimate can also arise from an exon that
actually has low true dispersion, simply because of the estimator's
sampling noise. This costs power, and estimating the dispersion from as
many samples as possible can considerably reduce this loss of power.
In classical OLS ANOVA, something like our maximum rule is not necessary
because the F test is informed about the precision of the variance
estimate by the specified lower DoF number. Nevertheless, the effect is
the same: an ANOVA procedure has more power if the variance is estimated
in a pooled fashion from all samples and this is hence what is usually
done. (Furthermore, the standard F test would not even be valid if the
conditions had unequal variances.)
Hence, stick to pooling all samples, unless one of the conditions is
really bad.
Simon
More information about the Bioconductor
mailing list