[BioC] A few Q's on using DEXSeq with mucho data

Thu Mar 8 18:59:00 CET 2012

Hi Steve

On 03/08/2012 04:08 PM, Steve Lianoglou wrote:
> Imagine, if you will, that I have data for 7-8 different conditions
> and I'd like to use DEXSeq to test for differential exon usage.
>
> Is it best to create an ExonCountSet with all of my data (with an
> appropriate design matrix that identifies which samples belong to
> which condition) and do the estimateDispersion step with that?
>
> My guess is yes, but I wanted to double check -- am I in danger of
> maybe flagging some genes/exons as non-testable if, for example, it is
> only expressed in 2 out of 8 conditions?

An exon is flagged as not testable if the sum over all samples of the 
read counts mapped to it is less then 'minCounts' (default: 10). Hence, 
adding additional samples will never reduce the number of testable exons.

> Assuming I should use all of my data to `estimateDisperion`s, what if
> I only have one replicate for one of the 8 conditions, is it best to
> remove it? I'm guessing it wouldn't provide any meaning information
> for dispersion estimation since it's only one observation for that
> condition.

The non-replicates sample would be ignored in the dispersion estimation 
anyway.

> Lastly, when testing for differences in exon usage, assuming I've been
> using the data from all of my experiments up to this point, I don't
> see a way to specify which experiments I want to specifically test.
>
> If I run "the normal" DEXSeq analysis on my data, I end up with a
> DEUresultTable that looks like it has log2fold(x/y) values for all
> experiments against just one y. There is only one pvalue/padjust
> column, so I'm not sure what comparison that is for.

The p value is from a test against the null hypothesis that exon usage 
does not depend on condition, i.e., that it is the same between all 
conditions (similar to the F test in standard anova). If you want to 
contrast two specific conditions, you should subset.

This leaves the question whether to subset before or after dispersion 
estimation, which is what your follow-up plot is concerned with, too:

> I had some thought for you that struck me right after I sent out this
> email, specifically your "hunch" about using all the data to estimate
> and fit the dispersion for each "bin".
>
> Perhaps it's a good idea to use all the data to estimateDispersions?
> (Still quite curious about that), but it seems that doing the
> subsequent `fitDispersionFunction` step might not be a great idea to
> run against all of the data at once.
>
> I say this because by looking through the code, it seems like the
> dispersion is fit for each exon/bin by the (normalized) mean
> expression of that bin across the entire dataset. So, if the
> expression of the gene (and exon) is quite variable across all
> conditions we have data for, when we go back and try to test
> differential exon usage for a specific condition 1 vs. condition 2
> case, I think we'd rather fit the dispersion for the mean expression
> of that bin for the two conditions under test.
>
> Isn't that right?

If some of your conditions have much higher variability than the other 
conditions, they will in fact cause you to lose power even in 
comparisons which do not involve them. This would be an argument for 
subsetting before dispersion estimation.

However, if the dispersions are roughly the same in all conditions, you 
are better off with estimating a pooled dispersion from all samples. 
This is because the per-exon estimates are then based on more degrees of 
freedom and have less sampling variance. Remember that DEXSeq uses in 
its test the maximum of the per-exon estimate and the fitted estimate. 
This is to avoid that exons with truly high variability are tested with 
a too low dispersion value, which would likely lead to a false positive. 
However, a high per-exon estimate can also arise from an exon that 
actually has low true dispersion, simply because of the estimator's 
sampling noise. This costs power, and estimating the dispersion from as 
many samples as possible can considerably reduce this loss of power.

In classical OLS ANOVA, something like our maximum rule is not necessary 
because the F test is informed about the precision of the variance 
estimate by the specified lower DoF number. Nevertheless, the effect is 
the same: an ANOVA procedure has more power if the variance is estimated 
in a pooled fashion from all samples and this is hence what is usually 
done. (Furthermore, the standard F test would not even be valid if the 
conditions had unequal variances.)

Hence, stick to pooling all samples, unless one of the conditions is 
really bad.

   Simon